[Scons-users] CacheDir race during parallel Windows builds?

Jason Kenny dragon512 at live.com
Sun Aug 7 12:03:15 EDT 2016


Hi William,

 

Was just looking over the e-mail. I can quickly reply that I remove this code in Parts. This was based on making windows at like unix when doing a spawn of a process via a fork like setup. I was using subprocess. In fact this prevented me from opening CON: and had some other negative effects on our -J builds.

As you can see on my overrides in Parts here: https://bitbucket.org/sconsparts/parts/src/3a389f774f234694994071d784af88c3babaad03/parts/overrides/os_file.py?at=master <https://bitbucket.org/sconsparts/parts/src/3a389f774f234694994071d784af88c3babaad03/parts/overrides/os_file.py?at=master&fileviewer=file-view-default> &fileviewer=file-view-default 

 

That at line 22,23 I do an import to help replace that code.

 

In fact some of the first code added to this file was this., adding the open to get SHARED_xxx flags was second.

 

I am unclear on if this would effect cacheDir

 

 

Jason

 

From: Scons-users [mailto:scons-users-bounces at scons.org] On Behalf Of William Blevins
Sent: Sunday, August 7, 2016 10:40 AM
To: SCons users mailing list <scons-users at scons.org>
Subject: Re: [Scons-users] CacheDir race during parallel Windows builds?

 

Jason,

I was looking at SCons.Platform.win32.py <http://SCons.Platform.win32.py> , and it appears that SCons replaces the default open on windows with win32api. Snippet from 2.5.0:

try:
    import msvcrt
    import win32api
    import win32con

    msvcrt.get_osfhandle
    win32api.SetHandleInformation
    win32con.HANDLE_FLAG_INHERIT
except ImportError:
    parallel_msg = \
        "you do not seem to have the pywin32 extensions installed;\n" + \
        "\tparallel (-j) builds may not work reliably with open Python files."
except AttributeError:
    parallel_msg = \
        "your pywin32 extensions do not support file handle operations;\n" + \
        "\tparallel (-j) builds may not work reliably with open Python files."
else:
    parallel_msg = None

    _builtin_file = file
    _builtin_open = open

    class _scons_file(_builtin_file):
        def __init__(self, *args, **kw):
            _builtin_file.__init__(self, *args, **kw)
            win32api.SetHandleInformation(msvcrt.get_osfhandle(self.fileno()),
                win32con.HANDLE_FLAG_INHERIT, 0)

    def _scons_open(*args, **kw):
        fp = _builtin_open(*args, **kw)
        win32api.SetHandleInformation(msvcrt.get_osfhandle(fp.fileno()),
                                      win32con.HANDLE_FLAG_INHERIT,
                                      0)
        return fp

    file = _scons_file
    open = _scons_open

 

V/R,

William 

 

On Sun, Aug 7, 2016 at 3:48 PM, Andrew C. Morrow <andrew.c.morrow at gmail.com <mailto:andrew.c.morrow at gmail.com> > wrote:

 

Hi Jason -

 

Thanks for getting back to me. Comments below:

 

 

On Sun, Aug 7, 2016 at 1:35 AM, Jason Kenny <dragon512 at live.com <mailto:dragon512 at live.com> > wrote:

 

Sorry for the delay, I did not see this email as moved to “junk” for some reason.

 

I do recall having issues with not being able to open a file on windows. There are a couple of reasons this happened

 

1)      Antivirus software

There is none installed on the system beyond anything intrinsic to Windows, and if that were interfering, I'd expect a very low rate of occurrence since it would seem to me that you would need to be unlucky and visit a file in some way at the same time as AV. The rate of occurrence is high (see below).

 

 

2)      The search client in the background indexing the files

The failures are far too common for this to always the the cause, basically the same argument as above.

 

 

3)      The way python or other classic “c” program open files on windows

More on that below.

 

 

 

 

The last is the most important. In parts I added support for long path/file names which is around 32K in length  and support for symlink and hardlinks. Honestly lots of the “ugly” code I parts with win32 is based on this support. Some of it is also for supporting win2000/XP systems. That code at this point should be pulled out. The core problem that I found, and reported to some people at Microsoft was the way the fopen() opens a files on windows ( it was being looked, don’t know if it was fixed). It does not have FILE_SHARE_WRITE, FILE_SHARE_DELETE set correctly. This causes a case in which a process can open process to simply read it and prevent other processes from writing to it. This prevents Hardlinks from working as they should on win32 systems.

 

I don't see how that could be a problem here. Once an object file has been copied out of the cache, nothing is trying to write to the copied file, only the linker is trying to read it. With the fix for the file handle leak applied, the only way this could happen would be if shutil.copy2 failed to close the file, and it also uses the with ... as pattern.

 

I suppose the other way it could happen would be if SCons decided to try to start the link step while the copy of a needed object file out of the cache was still in progress, so that the file in the variant dir being copied to was still open for write. That would seem like the sort of error that could occur in a parallel build if somehow the signal that something had been extracted from the cache was emitted prematurely (like, when the copy started, rather than when it finished), allowing another thread to begin to execute the dependent link task too early.

 

But then, if that was the case, why haven't we seen this on Linux builds?

 

 

 

Anyways to get all this to work I had to move everything to use Win32 Api for file handling. Do this did seem to resolve a number of issues. This should be all ( or mostly done) done as a big monkey patch in parts/overrides/os_file.py. In theory one should be able to just import Parts to the main SConstruct and it should all work. ( I admit I might be wrong with some details for a Raw SCons build.) But given I did fix the last known items It should be simple to test if this helps or not by just “from parts import *” at the start of the Sconstruct ( and install Parts).

 

If the Python Windows file handling is buggy and renders SCons unreliable, shouldn't this part of parts be upstreamed into SCons?

 

 

 

We also had random failures do to a case in which a tools ( in our case the Intel tools) would load a header for reading ( and because it was opened with fopen, not missed SHARED_DELETE) a hard link to that file in some SDK area could not be deleted and fail the build ( As SCons wants to delete everything before it does the actions) we hacked some fixes to prevent this from happening. I found that in general the tool form Microsoft seem to open files correctly ( via a win32 call) This leaves Python as a likely cause. If I was to guess what was happening there might be something on with SCons doing some check on the obj file with cacheDir on.

If so the Parts fixes might resolve them, if python/Scons is holding a file handle, given we change the default file permission used to open a file at the python level.

 

I didn't really understand this part.

 

 

 

Overall, the information about parts is useful background, but adding yet another layer of complexity does not seem like a step in the right direction while trying to diagnose this. Adding Parts might indeed make the problem go away, but I still wouldn't know why it did so, or what the root cause was.

 

I think it is important to find that out.

 

 

 

I would also find it interesting to know what happens if they did rebuild right after this failure. I was common for us to do a Scons -j32 … build then if that failed to a Scons -k -j1 … to get a complete list of failures to fix.  We noticed on linux or and windows that this helped us see failure that just went away on the second build. This helped us get clues on what was going on.

 

I do this frequently in my development workflow if I have, for instance, a very localized compile error that I know how to fix but want to see how the rest of the build goes, but it doesn't help for our CI builds. Our -j10 builds on Windows already take upwards of 40 minutes, which is why we are so interested in CacheDir. Backing down to serial after a failure isn't an option. It would also be very confusing. Given a failed build step, how do I distinguish between the case of a build that failed due to flaky build system behavior (and, therefore, fall back to serial) and a build that failed because of a legitimate compile error (in which case I should stop)? Because In the latter case, continuing with a -k build is just going to emit an astronomical logfile full of compiler errors and burn a ton of time and CPU.

 

 

 

In this case does the file now link after the Scons process died or is it still locked in the new SCons process?

 

If I repeatedly invoke SCons, each build gets some more targets linked. Eventually, I get to the end. The setup here is that the cache dir is populated, but all linked executables need to be rebuilt. Starting with an empty variant dir, the first run links many targets but eventually fails. Restarting the build while leaving the artifacts from run 1 in the variant dir makes further progress, but also eventually fails. After 5 or 6 cycles of this, enough targets have been built that the last run has only a little work to do at all, and the build passes. 

 

 

Does this happen all the time.

 

Yes, every time, if I start from the populated cache dir and empty variant_dir state, where all executables require re-link, I have never observed a successful one-shot parallel build when CacheDir is enabled.

 

 

On the same file?

 

No, different files are seen as non-existing/unreadable by the linker each time. It is truly unfortunate that Microsoft in their wisdom failed to make the error reporting for LNK1104 (https://msdn.microsoft.com/en-us/library/ts7eyw4s.aspx) more detailed. As it is, there is no way to see whether the linker simply couldn't find the file because it didn't exist, or found it but was unable to read it due to either permissions or locking. I will try the above suggestion of processmonitor to try to find out which is really happening.

 

 

And does it happen only on -j build with j>1?

 

I'm only interested in -j>1 builds, really. As mentioned above, serial builds would take hours. However in the interest of determining whether parallelism is at fault, I ran one. Again, this is starting with a populated cache dir and empty variant dir. Basically, all the .obj files are in the cache, but all of the executables need to be re-linked.

 

Without CacheDir enabled but without -jN, it worked in one shot.

 

So we know that parallel builds without CacheDir work because we have been doing them forever. And we know that serial builds with CacheDir work, because I just did one. And we know that parallel builds with cache dir work on Linux, because our developers have been doing it forever, and we turned it on for Linux builds in our CI system without incident.

 

The problem, whatever it may be, appears to be limited to (or, perhaps, is only evident during) parallel CacheDir enabled builds on Windows.

 

 

 

I also recall us having issue with our CI system as they ran in VMs. It seemed at times odd behavior would happen on a VM vs a bare metal machine, however these seemed to show up randomly, not constantly.

 

Our (and, increasingly, the rest of the world) entire CI system is oriented around AWS and other VM'ish providers, so I don't really think that advocating bare metal as a workaround is actionable. Note however that there still is randomness here: the problem occurs with different files/targets every time.

 

 

 

Hope this helps

 

I do appreciate your sharing your thoughts on this. We had been building up to large scale deployment of CacheDir in our CI system for some time, and with high hopes for dramatically improved build times, so there is definite frustration that CacheDir seems not to work on the platform where we needed it most.

 

Thanks,

Andrew

 

 

 

Jason

 

From: Bill Deegan [mailto:bill at baddogconsulting.com <mailto:bill at baddogconsulting.com> ] 
Sent: Thursday, August 4, 2016 7:40 PM
To: SCons users mailing list <scons-users at scons.org <mailto:scons-users at scons.org> >; Jason Kenny <dragon512 at live.com <mailto:dragon512 at live.com> >
Subject: Re: [Scons-users] CacheDir race during parallel Windows builds?

 

Jason,

I know you did a lot of work on Windows builds.

Did you run into anything similar to this?

I notice Parts pulls in win32's CopyFile API

-Bill

 

On Thu, Aug 4, 2016 at 10:20 AM, Andrew C. Morrow <andrew.c.morrow at gmail.com <mailto:andrew.c.morrow at gmail.com> > wrote:

 

Hi -

 

At MongoDB, we recently started using CacheDir in our CI system. This has been a big success for reducing rebuild times for our Linux builds, however, we were surprised to find that our Windows builds started failing in a very alarming way:

 

Please see the following log file: <https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_debug_compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_08_02_20_24_46/0?type=T>  https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_debug_compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_08_02_20_24_46/0?type=T

 

The log lines of interest are:

 

[2016/08/02 17:31:09.642] Retrieved `build\cached\mongo\base\data_type_terminated_test.obj' from cache

 

Here, we see that we retrieved this .obj file from the cache. Nine seconds later, we try to use that object in a link step:

 

[2016/08/02 17:31:18.921] link /nologo /DEBUG /INCREMENTAL:NO /LARGEADDRESSAWARE /OPT:REF /OUT:build\cached\mongo\base\base_test.exe build\cached\mongo\base\data_range.obj ...   build\cached\mongo\base\data_type_terminated_test.obj ...

 

The link fails, claiming that the data_type_terminated_test.obj file cannot be opened:

 

[2016/08/02 17:31:20.363] LINK : fatal error LNK1104: cannot open file 'build\cached\mongo\base\data_type_terminated_test.obj'

 [2016/08/02 17:31:20.506] scons: *** [build\cached\mongo\base\base_test.exe] Error 1104

 

We are using a vendored copy of SCons 2.5.0. The only modification is this:

 

https://github.com/mongodb/mongo/commit/bc7e4e6821639ee766ada83483975668af98f367#diff-cc7aec1739634ca2a857a4d4227663aa

 

This change was made so that the atime of files in the cache is fine-grained accurate, even if the underlying filesystem is mounted noatime or relatime, so that we can prune the cache based on access time. We would like to propose this change to be upstreamed, but that is a separate email.

 

SCons was invoked as follows from within an SSH session into cygwin (you can see it at the top of the build log as well):

 

python ./buildscripts/scons.py --dbg=on --opt=on --win-version-min=ws08r2 -j$(( $(grep -c ^processor /proc/cpuinfo) / 2 )) MONGO_DISTMOD=2008plus --cache --cache-dir='z:\data\scons-cache\9d73adcd-19eb-46f2-9988-b8594ba5a3d1' --use-new-tools all dist dist-debugsymbols distsrc-zip  MONGO_VERSION=3.3.10-250-g81185a5

 

The 'python' here is Windows python, not cygwin, and PyWin32 is installed.

 

The system on which this build ran is running Windows 2012 on a dedicated spot AWS c3.4xlarge instance, and the toolchain is Visual Studio 2015 The Z drive, where the cache directory is located, is locally connected NTFS via AWS ephemeral/instance storage.

 

We have since backed out using the Cache on our Windows builds, which is disappointing - Windows builds take forever compared to others, and we were really hoping that CacheDir would be a big help here.

 

Has anyone seen anything like this, or has some ideas what may be going wrong here? I know there have been some other recent threads about problems with Windows and build ordering, but this seems different - the retrieval of the file from the Cache was correctly ordered, but it doesn't appear to have been effective.

 

I'm happy to provide any additional information if it will help us get Windows CacheDir enabled builds working.

 

Thanks,

Andrew

 

 


_______________________________________________
Scons-users mailing list
Scons-users at scons.org <mailto:Scons-users at scons.org> 
https://pairlist4.pair.net/mailman/listinfo/scons-users

 


_______________________________________________
Scons-users mailing list
Scons-users at scons.org <mailto:Scons-users at scons.org> 
https://pairlist4.pair.net/mailman/listinfo/scons-users

 


_______________________________________________
Scons-users mailing list
Scons-users at scons.org <mailto:Scons-users at scons.org> 
https://pairlist4.pair.net/mailman/listinfo/scons-users

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160807/560fa1fd/attachment-0001.html>


More information about the Scons-users mailing list