[Scons-users] modifying/extending SCons to run jobs on Torque cluster

Fri Oct 3 05:57:46 EDT 2014

Hi Thomas,

I'd like to basically second what Bill said. On a techical level, you 
can certainly subclass/rewrite the Node/Taskmaster classes...and there 
have been requests for more info about it in the past. But it's an awful 
lot of work, and all the people that wanted to try anyway, seemed to 
have given up at one point.

(* switching to meta-level mode *)
My understanding of your problem/project is, that you try to use SCons 
as a "driver" to your scheduling system. In a way, you want to 
"traffic-shape" the single build processes, to let them run on a 
multiprocessor machine/cluster (I have tinkered with openPBS on a 
48-core Linux cluster some years ago).

If your build process is based on files and their dependencies, the 
current Node class and the Taskmaster should provide all the information 
you need, for deciding whether a single part of the project has to be 
rebuilt or not. The Taskmaster already prepares info packets for you, in 
the form of the Job class instances, that then only have to be executed 
somewhere.

And this is probably the best place where your extension could come into 
play. You could try to derive from the "Job" class and extend it, such 
that it is also able to run a single build step via your scheduler 
system (you seem to have that ready in your custom Builder).
Quoting a part of your original email

On 02.10.2014 21:24, Thomas Lippincott wrote:
> [...]
>
> I would like to do something like subclassing the TaskManager to be able
> to examine the dependency tree and choose subtrees to submit as Torque
> jobs.
This is where the real problem is: deciding which nodes to build via 
Torque (or another scheduler), and which not, is super hard. You're 
trying to implement a second scheduler...which requires you to "add more 
information" to the system. I don't think you'll be able to compute an 
efficient schedule (taking the actual cluster/machine where things are 
executed into account) just by traversing the dependency tree.

I'll go one step further and state that I wouldn't touch this problem 
with a ten foot pole.
(* meta-level mode off *)

Instead, I would stick to manually marking the nodes that are eligible 
to getting scheduled, within the SConscripts. You could write a 
wrapper/decorator method like:

   prog = TorqueJob(env.Program('main',Glob('*.cpp')), 
required_mem="2GB",..., other keys)

for "tagging" the target node "main" in this case. The Node class 
already has the member "attributes" which you can use to store 
meta-information about it (this is what the Java builder does, for example).

In your custom Job class you can then take the final steps to check 
whether the current target should be built via Torque (or locally, if 
the overall load on the cluster is too high already), based on your 
meta-infos as given by the user. Then, setup the correct environment for 
this, before scheduling the actual command-line action.
Just regard that the Taskmaster expects these single Job executions to 
be blocking. So you start a build step, and when it finished executing 
it's either complete (target got built) or it failed. This info and 
behaviour is crucial to the currently implemented algorithm...

So much for my thoughts, I hope it gives you a few new ideas.

Best regards,

Dirk