[Scons-users] modifying/extending SCons to run jobs on Torque cluster

Thomas Lippincott tom.lippincott at gmail.com
Fri Oct 3 15:14:41 EDT 2014


It's definitely true that I've gotten a lot out of simply switching
command-line builders to just submit the command along with some minimal
configuration.  It sounds like it could be simplified even further, with
your suggestion of redefining the environment variable(s).

Just submitting the job isn't enough though, SCons needs to track the
job, so it doesn't proceed to any targets that depend on it.  One nice
haspect of Torque is that jobs can have dependencies on other jobs: this
could be useful if SCons dependencies could be programmatically
translated to Torque dependencies.

Torque does have extensive resource-request functionality, so this can
help mitigate jobs being too small.

To me, the great appeal of deeper changes is the ability to use
non-command-line builders without modification.  I still like the idea
of running separate SCons invocations on each node, each building a
fragment of the dependency tree, but I'd need to stop them from
contending for access to the .sconsign.db file, and settle up on the
main node.  Is there functionality for directing the scons command to
use a different .sconsign.db file, and an API for
reading/merging/writing entries from different files?

-Tom





On 10/03/2014 12:13 PM, Bill Deegan wrote:
> Does Torque allow you to request resources? (this is one way we select
> which type of node to run jobs on in SGE).
> If so you could just specify the "BIGJOB" resource  and only mark certain
> nodes as having it, and request that when running the "big" jobs..
> 
> -Bill
> 
> On Fri, Oct 3, 2014 at 2:57 AM, Dirk Bächle <tshortik at gmx.de> wrote:
> 
>> Hi Thomas,
>>
>> I'd like to basically second what Bill said. On a techical level, you can
>> certainly subclass/rewrite the Node/Taskmaster classes...and there have
>> been requests for more info about it in the past. But it's an awful lot of
>> work, and all the people that wanted to try anyway, seemed to have given up
>> at one point.
>>
>> (* switching to meta-level mode *)
>> My understanding of your problem/project is, that you try to use SCons as
>> a "driver" to your scheduling system. In a way, you want to "traffic-shape"
>> the single build processes, to let them run on a multiprocessor
>> machine/cluster (I have tinkered with openPBS on a 48-core Linux cluster
>> some years ago).
>>
>> If your build process is based on files and their dependencies, the
>> current Node class and the Taskmaster should provide all the information
>> you need, for deciding whether a single part of the project has to be
>> rebuilt or not. The Taskmaster already prepares info packets for you, in
>> the form of the Job class instances, that then only have to be executed
>> somewhere.
>>
>> And this is probably the best place where your extension could come into
>> play. You could try to derive from the "Job" class and extend it, such that
>> it is also able to run a single build step via your scheduler system (you
>> seem to have that ready in your custom Builder).
>> Quoting a part of your original email
>>
>> On 02.10.2014 21:24, Thomas Lippincott wrote:
>>
>>> [...]
>>>
>>> I would like to do something like subclassing the TaskManager to be able
>>> to examine the dependency tree and choose subtrees to submit as Torque
>>> jobs.
>>>
>> This is where the real problem is: deciding which nodes to build via
>> Torque (or another scheduler), and which not, is super hard. You're trying
>> to implement a second scheduler...which requires you to "add more
>> information" to the system. I don't think you'll be able to compute an
>> efficient schedule (taking the actual cluster/machine where things are
>> executed into account) just by traversing the dependency tree.
>>
>> I'll go one step further and state that I wouldn't touch this problem with
>> a ten foot pole.
>> (* meta-level mode off *)
>>
>> Instead, I would stick to manually marking the nodes that are eligible to
>> getting scheduled, within the SConscripts. You could write a
>> wrapper/decorator method like:
>>
>>   prog = TorqueJob(env.Program('main',Glob('*.cpp')),
>> required_mem="2GB",..., other keys)
>>
>> for "tagging" the target node "main" in this case. The Node class already
>> has the member "attributes" which you can use to store meta-information
>> about it (this is what the Java builder does, for example).
>>
>> In your custom Job class you can then take the final steps to check
>> whether the current target should be built via Torque (or locally, if the
>> overall load on the cluster is too high already), based on your meta-infos
>> as given by the user. Then, setup the correct environment for this, before
>> scheduling the actual command-line action.
>> Just regard that the Taskmaster expects these single Job executions to be
>> blocking. So you start a build step, and when it finished executing it's
>> either complete (target got built) or it failed. This info and behaviour is
>> crucial to the currently implemented algorithm...
>>
>> So much for my thoughts, I hope it gives you a few new ideas.
>>
>> Best regards,
>>
>> Dirk
>>
>>
>> _______________________________________________
>> Scons-users mailing list
>> Scons-users at scons.org
>> https://pairlist4.pair.net/mailman/listinfo/scons-users
>>
> 
> 
> 
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
> 


More information about the Scons-users mailing list