[Scons-users] CacheDir race during parallel Windows builds?

William Blevins wblevins001 at gmail.com
Wed Aug 10 14:20:21 EDT 2016


Andrew,

Great news! Impressive that you were able to track that down (with or
without help)!.

V/R.
William

On Wed, Aug 10, 2016 at 7:07 PM, Andrew C. Morrow <andrew.c.morrow at gmail.com
> wrote:

>
> Hi All -
>
> Our resident Windows expert reached the following conclusions:
>
> Summary: It is a python bug in shutil.copy2
>
> The issue:
> When link.exe opens a obj file, it gets the following error:
> LINK : fatal error LNK1104: cannot open file 'build\cached\mongo\tools\mong
> obridge_options_init.obj'
>
> To diagnose these errors, I enabled ETW tracing with "FileIO stackwalk for
> FileCreate
> <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964768(v=vs.85).aspx>
> +FileCleanup+FileClose"
> <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964773(v=vs.85).aspx> and
> cranked through WPA with the data on Window 2008 R2 & Window 2012 R2. The
> 2012 R2 os offers stack traces which is why I used it
>
> By tracing the build, I can see that link.exe has a call to CreateFile
> fail with
> "A file cannot be opened because the share access flags are
> incompatible. (0xc0000043)"
>
> This occurs because it asked for a file with the following flags "file_open
> synchronous_io_nonalert non_directory_file shareRead", and another
> process had an existing handle to the file"file_overwrite_if
> synchronous_io_nonalert non_directory_file normal shareRead shareWrite".
>
> The existing process that had a handle to the file was none other then
> "python.exe" which created the file originally in copy2, but did not close
> it. I compared normal cases, and it does succesfully close the file. I do
> not know why like 1/100 or 1/200 times it fails.
>
> The workaround is win32file.CopyFile
> <http://timgolden.me.uk/python/win32_how_do_i/copy-a-file.html>.
>
> I patched FS.py with and it worked fine
> win32file.CopyFile(src, dst, 1)
>         return True
>
>
> I can confirm that the issue no-longer reproduces for me with the
> following change to FS.py:
>
> https://github.com/tychoish/mongo/commit/c8450fb4d304b2de06ba968b71f6ef
> acd3b5214e
>
> While I'd love to follow this deeper, debugging python's file system
> internals on Windows is not something I can really invest time in right
> now. We are most likely just going to make the above patch to our vendored
> copy of SCons and continue.
>
> Perhaps someone with more Python expertise would be interested in pursuing
> this further? I can give very detailed reproduction instructions.
>
> Thanks,
> Andrew
>
>
> On Mon, Aug 8, 2016 at 9:55 AM, Jason Kenny <dragon512 at live.com> wrote:
>
>> I am curious on what you find, please let us know what you discover.
>>
>>
>>
>> I am thinking more and more the linker issue is the windows linker trying
>> to lock the file that prevents any file handles with write permission to be
>> open on it. That’s is just my gut feeling.
>>
>>
>>
>> Jason
>>
>>
>>
>> *From:* Scons-users [mailto:scons-users-bounces at scons.org] *On Behalf Of
>> *Andrew C. Morrow
>> *Sent:* Monday, August 8, 2016 7:31 AM
>> *To:* SCons users mailing list <scons-users at scons.org>
>> *Subject:* Re: [Scons-users] CacheDir race during parallel Windows
>> builds?
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Aug 7, 2016 at 6:35 PM, Jason Kenny <dragon512 at live.com> wrote:
>>
>>
>>
>> Hi,
>>
>>
>>
>> So let me go over what we know:
>>
>> 1) no cache and serial build -> worked
>>
>> 2) no cache and -j build -> Worked
>>
>> 3) cache and serial build -> Worked
>>
>> 4) cache and -j build -> Fail constantly
>>
>>
>>
>> Correct, with two caveats:
>>
>>
>>
>> 1) I've never actually attempted case 1, on any platform. I can, if you
>> think it would provide any value, but I'm nearly certain that it works
>> every time.
>>
>> 2) These are the results on Windows; we have so far never observed the
>> case 4 errors on Linux, OS X, or Solaris.
>>
>>
>>
>>
>>
>>
>>
>> From this it would seems to be having the cache on and a parallel build.
>> My guess is that a thread was doing something with the file and the main
>> thread was doing something else to have this happen.
>>
>> Then I did a simple test.
>>
>>
>>
>> Basically I opened an object file I just built manually in different
>> python interactive shell I opened it only as “r” and left it open.
>>
>> I could link the program in a different shell.
>>
>> If it opened the file with “a” or “r+” ( anything with implied write),
>> the program would not link with a “LINK : fatal error LNK1104: cannot open
>> file 'hello.obj'”.
>>
>>
>>
>> I am guessing that the linker has some “exclusive” read mode set that
>> fails is the object file is opened with a write mode. If I try to do this
>> on Linux it looks like it works fine even is python has an open handle
>> Write handle open. Also if I do this with different processes on windows it
>> seem to be fine as well. I think the linker is locking the file while it
>> does some work to prevent it from changing while it is busy making the PE
>> format of the finial output.
>>
>>
>>
>> Based on this I would suggest we have a race in SCons with cacheDir set
>> in which python has a write mode handle open on the object file that was
>> not closed yet. I did this test on Windows 10 with VS 2015 ( I tested linux
>> on the bash shell feature on windows 10 and doubled checked on Ubuntu in a
>> VM). The race I would assume to be something with the actions running a
>> link command while the main thread is doing something with that file. Or
>> there is something else touching that file.
>>
>>
>>
>> I don’t know enough of the pathways with cacheDir at the moment to say
>> want would be going on.
>>
>>
>>
>> Nor do I. I'm going to enlist the help of one of our local Windows
>> experts to see if he can help with tooling that will show us exactly what
>> the conflict is. I'll report back any findings.
>>
>>
>>
>>
>>
>>
>>
>> I don’t think Parts File tweaks would help much with solving this problem
>> at the moment. Given 4) is the only time this happen, this *seems* to be
>> a SCons issue.
>>
>>
>>
>> I agree that it appears to be, but until we have a root cause it is of
>> course not possible to be sure.
>>
>>
>>
>> Thanks,
>>
>> Andrew
>>
>>
>>
>> _______________________________________________
>> Scons-users mailing list
>> Scons-users at scons.org
>> https://pairlist4.pair.net/mailman/listinfo/scons-users
>>
>>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160810/7cd30acb/attachment-0001.html>


More information about the Scons-users mailing list