[Scons-users] CacheDir race during parallel Windows builds?

William Blevins wblevins001 at gmail.com
Sun Aug 7 11:40:28 EDT 2016


Jason,

I was looking at SCons.Platform.win32.py, and it appears that SCons
replaces the default open on windows with win32api. Snippet from 2.5.0:

try:
>     import msvcrt
>     import win32api
>     import win32con
>
>     msvcrt.get_osfhandle
>     win32api.SetHandleInformation
>     win32con.HANDLE_FLAG_INHERIT
> except ImportError:
>     parallel_msg = \
>         "you do not seem to have the pywin32 extensions installed;\n" + \
>         "\tparallel (-j) builds may not work reliably with open Python
> files."
> except AttributeError:
>     parallel_msg = \
>         "your pywin32 extensions do not support file handle operations;\n"
> + \
>         "\tparallel (-j) builds may not work reliably with open Python
> files."
> else:
>     parallel_msg = None
>
>     _builtin_file = file
>     _builtin_open = open
>
>     class _scons_file(_builtin_file):
>         def __init__(self, *args, **kw):
>             _builtin_file.__init__(self, *args, **kw)
>
> win32api.SetHandleInformation(msvcrt.get_osfhandle(self.fileno()),
>                 win32con.HANDLE_FLAG_INHERIT, 0)
>
>     def _scons_open(*args, **kw):
>         fp = _builtin_open(*args, **kw)
>         win32api.SetHandleInformation(msvcrt.get_osfhandle(fp.fileno()),
>                                       win32con.HANDLE_FLAG_INHERIT,
>                                       0)
>         return fp
>
>     file = _scons_file
>     open = _scons_open
>
>
V/R,
William

On Sun, Aug 7, 2016 at 3:48 PM, Andrew C. Morrow <andrew.c.morrow at gmail.com>
wrote:

>
> Hi Jason -
>
> Thanks for getting back to me. Comments below:
>
>
> On Sun, Aug 7, 2016 at 1:35 AM, Jason Kenny <dragon512 at live.com> wrote:
>
>>
>>
>> Sorry for the delay, I did not see this email as moved to “junk” for some
>> reason.
>>
>>
>>
>> I do recall having issues with not being able to open a file on windows.
>> There are a couple of reasons this happened
>>
>>
>>
>> 1)      Antivirus software
>>
> There is none installed on the system beyond anything intrinsic to
> Windows, and if that were interfering, I'd expect a very low rate of
> occurrence since it would seem to me that you would need to be unlucky and
> visit a file in some way at the same time as AV. The rate of occurrence is
> high (see below).
>
>
>
>> 2)      The search client in the background indexing the files
>>
> The failures are far too common for this to always the the cause,
> basically the same argument as above.
>
>
>
>> 3)      The way python or other classic “c” program open files on windows
>>
> More on that below.
>
>
>
>
>>
>>
>> The last is the most important. In parts I added support for long
>> path/file names which is around 32K in length  and support for symlink and
>> hardlinks. Honestly lots of the “ugly” code I parts with win32 is based on
>> this support. Some of it is also for supporting win2000/XP systems. That
>> code at this point should be pulled out. The core problem that I found, and
>> reported to some people at Microsoft was the way the fopen() opens a files
>> on windows ( it was being looked, don’t know if it was fixed). It does not
>> have FILE_SHARE_WRITE, FILE_SHARE_DELETE set correctly. This causes a case
>> in which a process can open process to simply read it and prevent other
>> processes from writing to it. This prevents Hardlinks from working as they
>> should on win32 systems.
>>
>
> I don't see how that could be a problem here. Once an object file has been
> copied out of the cache, nothing is trying to write to the copied file,
> only the linker is trying to read it. With the fix for the file handle leak
> applied, the only way this could happen would be if shutil.copy2 failed to
> close the file, and it also uses the with ... as pattern.
>
> I suppose the other way it could happen would be if SCons decided to try
> to start the link step while the copy of a needed object file out of the
> cache was still in progress, so that the file in the variant dir being
> copied to was still open for write. That would seem like the sort of error
> that could occur in a parallel build if somehow the signal that something
> had been extracted from the cache was emitted prematurely (like, when the
> copy started, rather than when it finished), allowing another thread to
> begin to execute the dependent link task too early.
>
> But then, if that was the case, why haven't we seen this on Linux builds?
>
>
>
>>
>> Anyways to get all this to work I had to move everything to use Win32 Api
>> for file handling. Do this did seem to resolve a number of issues. This
>> should be all ( or mostly done) done as a big monkey patch in
>> parts/overrides/os_file.py. In theory one *should* be able to just
>> import Parts to the main SConstruct and it should all work. ( I admit I
>> might be wrong with some details for a Raw SCons build.) But given I did
>> fix the last known items It should be simple to test if this helps or not
>> by just “from parts import *” at the start of the Sconstruct ( and install
>> Parts).
>>
>
> If the Python Windows file handling is buggy and renders SCons unreliable,
> shouldn't this part of parts be upstreamed into SCons?
>
>
>
>>
>>
>> We also had random failures do to a case in which a tools ( in our case
>> the Intel tools) would load a header for reading ( and because it was
>> opened with fopen, not missed SHARED_DELETE) a hard link to that file in
>> some SDK area could not be deleted and fail the build ( As SCons wants to
>> delete everything before it does the actions) we hacked some fixes to
>> prevent this from happening. I found that in general the tool form
>> Microsoft seem to open files correctly ( via a win32 call) This leaves
>> Python as a likely cause. If I was to guess what was happening there might
>> be something on with SCons doing some check on the obj file with cacheDir
>> on.
>>
>> If so the Parts fixes might resolve them, if python/Scons is holding a
>> file handle, given we change the default file permission used to open a
>> file at the python level.
>>
>
> I didn't really understand this part.
>
>
>
> Overall, the information about parts is useful background, but adding yet
> another layer of complexity does not seem like a step in the right
> direction while trying to diagnose this. Adding Parts might indeed make the
> problem go away, but I still wouldn't know why it did so, or what the root
> cause was.
>
> I think it is important to find that out.
>
>
>
>>
>>
>> I would also find it interesting to know what happens if they did rebuild
>> right after this failure. I was common for us to do a Scons -j32 … build
>> then if that failed to a Scons -k -j1 … to get a complete list of failures
>> to fix.  We noticed on linux or and windows that this helped us see failure
>> that just went away on the second build. This helped us get clues on what
>> was going on.
>>
>
> I do this frequently in my development workflow if I have, for instance, a
> very localized compile error that I know how to fix but want to see how the
> rest of the build goes, but it doesn't help for our CI builds. Our -j10
> builds on Windows already take upwards of 40 minutes, which is why we are
> so interested in CacheDir. Backing down to serial after a failure isn't an
> option. It would also be very confusing. Given a failed build step, how do
> I distinguish between the case of a build that failed due to flaky build
> system behavior (and, therefore, fall back to serial) and a build that
> failed because of a legitimate compile error (in which case I should stop)?
> Because In the latter case, continuing with a -k build is just going to
> emit an astronomical logfile full of compiler errors and burn a ton of time
> and CPU.
>
>
>
>>
>> In this case does the file now link after the Scons process died or is it
>> still locked in the new SCons process?
>>
>
> If I repeatedly invoke SCons, each build gets some more targets linked.
> Eventually, I get to the end. The setup here is that the cache dir is
> populated, but all linked executables need to be rebuilt. Starting with an
> empty variant dir, the first run links many targets but eventually fails.
> Restarting the build while leaving the artifacts from run 1 in the variant
> dir makes further progress, but also eventually fails. After 5 or 6 cycles
> of this, enough targets have been built that the last run has only a little
> work to do at all, and the build passes.
>
>
> Does this happen all the time.
>>
>
> Yes, every time, if I start from the populated cache dir and empty
> variant_dir state, where all executables require re-link, I have never
> observed a successful one-shot parallel build when CacheDir is enabled.
>
>
>
>> On the same file?
>>
>
> No, different files are seen as non-existing/unreadable by the linker each
> time. It is truly unfortunate that Microsoft in their wisdom failed to make
> the error reporting for LNK1104 (https://msdn.microsoft.com/en
> -us/library/ts7eyw4s.aspx) more detailed. As it is, there is no way to
> see whether the linker simply couldn't find the file because it didn't
> exist, or found it but was unable to read it due to either permissions or
> locking. I will try the above suggestion of processmonitor to try to find
> out which is really happening.
>
>
>
>> And does it happen only on -j build with j>1?
>>
>
> I'm only interested in -j>1 builds, really. As mentioned above, serial
> builds would take hours. However in the interest of determining whether
> parallelism is at fault, I ran one. Again, this is starting with a
> populated cache dir and empty variant dir. Basically, all the .obj files
> are in the cache, but all of the executables need to be re-linked.
>
> Without CacheDir enabled but without -jN, it worked in one shot.
>
> So we know that parallel builds without CacheDir work because we have been
> doing them forever. And we know that serial builds with CacheDir work,
> because I just did one. And we know that parallel builds with cache dir
> work on Linux, because our developers have been doing it forever, and we
> turned it on for Linux builds in our CI system without incident.
>
> The problem, whatever it may be, appears to be limited to (or, perhaps, is
> only evident during) parallel CacheDir enabled builds on Windows.
>
>
>
>>
>>
>> I also recall us having issue with our CI system as they ran in VMs. It
>> seemed at times odd behavior would happen on a VM vs a bare metal machine,
>> however these seemed to show up randomly, not constantly.
>>
>
> Our (and, increasingly, the rest of the world) entire CI system is
> oriented around AWS and other VM'ish providers, so I don't really think
> that advocating bare metal as a workaround is actionable. Note however that
> there still is randomness here: the problem occurs with different
> files/targets every time.
>
>
>
>>
>>
>> Hope this helps
>>
>
> I do appreciate your sharing your thoughts on this. We had been building
> up to large scale deployment of CacheDir in our CI system for some time,
> and with high hopes for dramatically improved build times, so there is
> definite frustration that CacheDir seems not to work on the platform where
> we needed it most.
>
> Thanks,
> Andrew
>
>
>
>>
>>
>> Jason
>>
>>
>>
>> *From:* Bill Deegan [mailto:bill at baddogconsulting.com]
>> *Sent:* Thursday, August 4, 2016 7:40 PM
>> *To:* SCons users mailing list <scons-users at scons.org>; Jason Kenny <
>> dragon512 at live.com>
>> *Subject:* Re: [Scons-users] CacheDir race during parallel Windows
>> builds?
>>
>>
>>
>> Jason,
>>
>> I know you did a lot of work on Windows builds.
>>
>> Did you run into anything similar to this?
>>
>> I notice Parts pulls in win32's CopyFile API
>>
>> -Bill
>>
>>
>>
>> On Thu, Aug 4, 2016 at 10:20 AM, Andrew C. Morrow <
>> andrew.c.morrow at gmail.com> wrote:
>>
>>
>>
>> Hi -
>>
>>
>>
>> At MongoDB, we recently started using CacheDir in our CI system. This has
>> been a big success for reducing rebuild times for our Linux builds,
>> however, we were surprised to find that our Windows builds started failing
>> in a very alarming way:
>>
>>
>>
>> Please see the following log file: https://evergreen.mongod
>> b.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_debug
>> _compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_08_02_
>> 20_24_46/0?type=T
>> <https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_debug_compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_08_02_20_24_46/0?type=T>
>>
>>
>>
>> The log lines of interest are:
>>
>>
>>
>> [2016/08/02 17:31:09.642] Retrieved `build\cached\mongo\base\data_type_terminated_test.obj'
>> from cache
>>
>>
>>
>> Here, we see that we retrieved this .obj file from the cache. Nine
>> seconds later, we try to use that object in a link step:
>>
>>
>>
>> [2016/08/02 17:31:18.921] link /nologo /DEBUG /INCREMENTAL:NO
>> /LARGEADDRESSAWARE /OPT:REF /OUT:build\cached\mongo\base\base_test.exe
>> build\cached\mongo\base\data_range.obj ...   build\cached\mongo\base\data_type_terminated_test.obj
>> ...
>>
>>
>>
>> The link fails, claiming that the data_type_terminated_test.obj file
>> cannot be opened:
>>
>>
>>
>> [2016/08/02 17:31:20.363] LINK : fatal error LNK1104: cannot open file
>> 'build\cached\mongo\base\data_type_terminated_test.obj'
>>
>>  [2016/08/02 17:31:20.506] scons: *** [build\cached\mongo\base\base_test.exe]
>> Error 1104
>>
>>
>>
>> We are using a vendored copy of SCons 2.5.0. The only modification is
>> this:
>>
>>
>>
>> https://github.com/mongodb/mongo/commit/bc7e4e6821639ee766ad
>> a83483975668af98f367#diff-cc7aec1739634ca2a857a4d4227663aa
>>
>>
>>
>> This change was made so that the atime of files in the cache is
>> fine-grained accurate, even if the underlying filesystem is mounted noatime
>> or relatime, so that we can prune the cache based on access time. We would
>> like to propose this change to be upstreamed, but that is a separate email.
>>
>>
>>
>> SCons was invoked as follows from within an SSH session into cygwin (you
>> can see it at the top of the build log as well):
>>
>>
>>
>> python ./buildscripts/scons.py --dbg=on --opt=on --win-version-min=ws08r2
>> -j$(( $(grep -c ^processor /proc/cpuinfo) / 2 )) MONGO_DISTMOD=2008plus
>> --cache --cache-dir='z:\data\scons-cache\9d73adcd-19eb-46f2-9988-b8594ba5a3d1'
>> --use-new-tools all dist dist-debugsymbols distsrc-zip
>>  MONGO_VERSION=3.3.10-250-g81185a5
>>
>>
>>
>> The 'python' here is Windows python, not cygwin, and PyWin32 is installed.
>>
>>
>>
>> The system on which this build ran is running Windows 2012 on a dedicated
>> spot AWS c3.4xlarge instance, and the toolchain is Visual Studio 2015
>> The Z drive, where the cache directory is located, is locally connected
>> NTFS via AWS ephemeral/instance storage.
>>
>>
>>
>> We have since backed out using the Cache on our Windows builds, which is
>> disappointing - Windows builds take forever compared to others, and we were
>> really hoping that CacheDir would be a big help here.
>>
>>
>>
>> Has anyone seen anything like this, or has some ideas what may be going
>> wrong here? I know there have been some other recent threads about problems
>> with Windows and build ordering, but this seems different - the retrieval
>> of the file from the Cache was correctly ordered, but it doesn't appear to
>> have been effective.
>>
>>
>>
>> I'm happy to provide any additional information if it will help us get
>> Windows CacheDir enabled builds working.
>>
>>
>>
>> Thanks,
>>
>> Andrew
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Scons-users mailing list
>> Scons-users at scons.org
>> https://pairlist4.pair.net/mailman/listinfo/scons-users
>>
>>
>>
>> _______________________________________________
>> Scons-users mailing list
>> Scons-users at scons.org
>> https://pairlist4.pair.net/mailman/listinfo/scons-users
>>
>>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160807/ed2d842c/attachment-0001.html>


More information about the Scons-users mailing list