[Scons-users] modifying/extending SCons to run jobs on Torque cluster
Dirk Bächle
tshortik at gmx.de
Fri Oct 3 05:57:46 EDT 2014
Hi Thomas,
I'd like to basically second what Bill said. On a techical level, you
can certainly subclass/rewrite the Node/Taskmaster classes...and there
have been requests for more info about it in the past. But it's an awful
lot of work, and all the people that wanted to try anyway, seemed to
have given up at one point.
(* switching to meta-level mode *)
My understanding of your problem/project is, that you try to use SCons
as a "driver" to your scheduling system. In a way, you want to
"traffic-shape" the single build processes, to let them run on a
multiprocessor machine/cluster (I have tinkered with openPBS on a
48-core Linux cluster some years ago).
If your build process is based on files and their dependencies, the
current Node class and the Taskmaster should provide all the information
you need, for deciding whether a single part of the project has to be
rebuilt or not. The Taskmaster already prepares info packets for you, in
the form of the Job class instances, that then only have to be executed
somewhere.
And this is probably the best place where your extension could come into
play. You could try to derive from the "Job" class and extend it, such
that it is also able to run a single build step via your scheduler
system (you seem to have that ready in your custom Builder).
Quoting a part of your original email
On 02.10.2014 21:24, Thomas Lippincott wrote:
> [...]
>
> I would like to do something like subclassing the TaskManager to be able
> to examine the dependency tree and choose subtrees to submit as Torque
> jobs.
This is where the real problem is: deciding which nodes to build via
Torque (or another scheduler), and which not, is super hard. You're
trying to implement a second scheduler...which requires you to "add more
information" to the system. I don't think you'll be able to compute an
efficient schedule (taking the actual cluster/machine where things are
executed into account) just by traversing the dependency tree.
I'll go one step further and state that I wouldn't touch this problem
with a ten foot pole.
(* meta-level mode off *)
Instead, I would stick to manually marking the nodes that are eligible
to getting scheduled, within the SConscripts. You could write a
wrapper/decorator method like:
prog = TorqueJob(env.Program('main',Glob('*.cpp')),
required_mem="2GB",..., other keys)
for "tagging" the target node "main" in this case. The Node class
already has the member "attributes" which you can use to store
meta-information about it (this is what the Java builder does, for example).
In your custom Job class you can then take the final steps to check
whether the current target should be built via Torque (or locally, if
the overall load on the cluster is too high already), based on your
meta-infos as given by the user. Then, setup the correct environment for
this, before scheduling the actual command-line action.
Just regard that the Taskmaster expects these single Job executions to
be blocking. So you start a build step, and when it finished executing
it's either complete (target got built) or it failed. This info and
behaviour is crucial to the currently implemented algorithm...
So much for my thoughts, I hope it gives you a few new ideas.
Best regards,
Dirk
More information about the Scons-users
mailing list