[Scons-users] modifying/extending SCons to run jobs on Torque cluster
Bill Deegan
bill at baddogconsulting.com
Thu Oct 2 23:21:06 EDT 2014
Tom,
I'm not familiar with Torque, however I'm pretty familiar with LSF and SGE
(Sun Grid Engine, now Univa or one of several open source distributions of
the same base code.)
Why not just change the env['SHELL'] or the ACTION shell string for the
builder you want to be something like <torque job submit script> <flag to
wait for completion> and then the command to run would be after that.
Or have the job write out a script and submit that with the env['ENV']
written out into it.
The taskmanager is pretty (perhaps overly) complex.
I'm guessing you'll get a pretty big bang for the bug without going that
deep.
-Bill
On Thu, Oct 2, 2014 at 12:24 PM, Thomas Lippincott <tom.lippincott at gmail.com
> wrote:
> Hi all,
>
> I use SCons for running NLP experiments, and I'm interested in
> implementing some functionality for SCons to make easier use of a Torque
> HPC system (a portable batch scheduler, "PBS"). The basic idea is you
> can submit a large number of shell commands, possibly with restrictions
> on what resources they need, and Torque will run them on suitable
> machines. In other words, I want to use it like the "-j" switch, but on
> a much larger scale. Right now, my implementation is only for
> command-line Builders: the corresponding Torque Builder just submits the
> job to the scheduler, gets the job ID, polls the scheduler until the job
> finishes, and determines whether the job succeeded. This works because
> a) the execution environment is guaranteed to be identical on the submit
> and execution nodes and b) all nodes share a file system (NFS).
>
> There are a few issues with this:
>
> 1: Not all builders are equal. Targets that take an hour should be run
> as separate jobs on their own compute nodes, but thousands of 10-second
> targets will be slow at best, bring down the system at worst. This is a
> pretty common situation for computational experiments, where a single
> run can be composed of lightweight text-mangling, intense
> number-crunching, and low-to-medium-weight evaluation.
> 2: Each Torque job must correspond to a thread on the submit node, and
> so we still need to use the "-j" switch. These threads won't do much,
> they just ask Torque whether their job is done and then go to sleep for
> a while, but "-j 1000" isn't a good idea! This can be mitigated
> somewhat, by "accumulating" jobs to be submitted (a la the "Jar"
> builder) and then having one thread monitor multiple jobs, but still,
> this is hackish and unsatisfying.
> 3: This only works for command-line actions. Ideally it would work
> for *any* SCons target, without modification to Builder definitions
> (which it currently relies on).
>
> I would like to do something like subclassing the TaskManager to be able
> to examine the dependency tree and choose subtrees to submit as Torque
> jobs. An execute node would build a subtree by running "scons LEAF1
> LEAF2..." where LEAFs are the leaves of the subtree. The execute node
> invocations of SCons wouldn't interact with the ".sconsign.dblite" file,
> but with a node-specific DB file. After an execute node completes, the
> submit node would merge the changes from the node-specific DB file into
> ".sconsign.dblite".
>
> I think this would address some of the issues: for 1), the TaskManager
> can choose tree fragments it believes appropriately-sized to benefit
> from Torque (perhaps using annotation or statistics about how long each
> Builder typically takes), for 3) since we're directly invoking scons on
> the execute nodes, any existing Builder should just work as-is. I
> haven't thought through how 2) will be affected, but it seems likely
> that looking at things from the TaskManager level rather than the
> Builder level will make it easier to find improvements.
>
> Anyways, I'm looking for any feedback on this. At a high level, has it
> been done? Is it feasible? Should I be posting on scons-dev? Is this
> something I should try to get funding to work on for a few months? At a
> lower level, is it a good idea to try subclassing TaskManager? Are
> there any examples of extensions that subclass TaskManager? Are there
> much better approaches I'm just unaware of? I'd prefer to just make
> "advanced user" extensions, rather than changing anything in the actual
> SCons code base.
>
> Thanks so much for the help,
>
> -Tom
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20141002/e1773231/attachment.html>
More information about the Scons-users
mailing list