[Scons-users] modifying/extending SCons to run jobs on Torque cluster
Thomas Lippincott
tom.lippincott at gmail.com
Thu Oct 2 15:24:38 EDT 2014
Hi all,
I use SCons for running NLP experiments, and I'm interested in
implementing some functionality for SCons to make easier use of a Torque
HPC system (a portable batch scheduler, "PBS"). The basic idea is you
can submit a large number of shell commands, possibly with restrictions
on what resources they need, and Torque will run them on suitable
machines. In other words, I want to use it like the "-j" switch, but on
a much larger scale. Right now, my implementation is only for
command-line Builders: the corresponding Torque Builder just submits the
job to the scheduler, gets the job ID, polls the scheduler until the job
finishes, and determines whether the job succeeded. This works because
a) the execution environment is guaranteed to be identical on the submit
and execution nodes and b) all nodes share a file system (NFS).
There are a few issues with this:
1: Not all builders are equal. Targets that take an hour should be run
as separate jobs on their own compute nodes, but thousands of 10-second
targets will be slow at best, bring down the system at worst. This is a
pretty common situation for computational experiments, where a single
run can be composed of lightweight text-mangling, intense
number-crunching, and low-to-medium-weight evaluation.
2: Each Torque job must correspond to a thread on the submit node, and
so we still need to use the "-j" switch. These threads won't do much,
they just ask Torque whether their job is done and then go to sleep for
a while, but "-j 1000" isn't a good idea! This can be mitigated
somewhat, by "accumulating" jobs to be submitted (a la the "Jar"
builder) and then having one thread monitor multiple jobs, but still,
this is hackish and unsatisfying.
3: This only works for command-line actions. Ideally it would work
for *any* SCons target, without modification to Builder definitions
(which it currently relies on).
I would like to do something like subclassing the TaskManager to be able
to examine the dependency tree and choose subtrees to submit as Torque
jobs. An execute node would build a subtree by running "scons LEAF1
LEAF2..." where LEAFs are the leaves of the subtree. The execute node
invocations of SCons wouldn't interact with the ".sconsign.dblite" file,
but with a node-specific DB file. After an execute node completes, the
submit node would merge the changes from the node-specific DB file into
".sconsign.dblite".
I think this would address some of the issues: for 1), the TaskManager
can choose tree fragments it believes appropriately-sized to benefit
from Torque (perhaps using annotation or statistics about how long each
Builder typically takes), for 3) since we're directly invoking scons on
the execute nodes, any existing Builder should just work as-is. I
haven't thought through how 2) will be affected, but it seems likely
that looking at things from the TaskManager level rather than the
Builder level will make it easier to find improvements.
Anyways, I'm looking for any feedback on this. At a high level, has it
been done? Is it feasible? Should I be posting on scons-dev? Is this
something I should try to get funding to work on for a few months? At a
lower level, is it a good idea to try subclassing TaskManager? Are
there any examples of extensions that subclass TaskManager? Are there
much better approaches I'm just unaware of? I'd prefer to just make
"advanced user" extensions, rather than changing anything in the actual
SCons code base.
Thanks so much for the help,
-Tom
More information about the Scons-users
mailing list