[Scons-users] CacheDir race during parallel Windows builds?

Andrew C. Morrow andrew.c.morrow at gmail.com
Thu Aug 11 11:54:28 EDT 2016


Yes, definitely fine to merge it upstream and we would like that. However,
two caveats:

- This does eliminate the distinction between copy and copy2. Is that an
issue?
- I think we should wait to merge this until after we have rolled this out
to our CI system and let it burn in for a while, starting next week.

Thanks,
Andrew


On Wed, Aug 10, 2016 at 3:55 PM, Bill Deegan <bill at baddogconsulting.com>
wrote:

> Andrew,
>
> O.k. with pulling this change into the SCons repo?
>
> -Bill
>
> On Wed, Aug 10, 2016 at 2:07 PM, Andrew C. Morrow <
> andrew.c.morrow at gmail.com> wrote:
>
>>
>> Hi All -
>>
>> Our resident Windows expert reached the following conclusions:
>>
>> Summary: It is a python bug in shutil.copy2
>>
>> The issue:
>> When link.exe opens a obj file, it gets the following error:
>> LINK : fatal error LNK1104: cannot open file
>> 'build\cached\mongo\tools\mongobridge_options_init.obj'
>>
>> To diagnose these errors, I enabled ETW tracing with "FileIO stackwalk
>> for FileCreate
>> <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964768(v=vs.85).aspx>
>> +FileCleanup+FileClose"
>> <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964773(v=vs.85).aspx> and
>> cranked through WPA with the data on Window 2008 R2 & Window 2012 R2. The
>> 2012 R2 os offers stack traces which is why I used it
>>
>> By tracing the build, I can see that link.exe has a call to CreateFile
>> fail with
>> "A file cannot be opened because the share access flags are
>> incompatible. (0xc0000043)"
>>
>> This occurs because it asked for a file with the following flags "file_open
>> synchronous_io_nonalert non_directory_file shareRead", and another
>> process had an existing handle to the file"file_overwrite_if
>> synchronous_io_nonalert non_directory_file normal shareRead shareWrite".
>>
>> The existing process that had a handle to the file was none other then
>> "python.exe" which created the file originally in copy2, but did not close
>> it. I compared normal cases, and it does succesfully close the file. I do
>> not know why like 1/100 or 1/200 times it fails.
>>
>> The workaround is win32file.CopyFile
>> <http://timgolden.me.uk/python/win32_how_do_i/copy-a-file.html>.
>>
>> I patched FS.py with and it worked fine
>> win32file.CopyFile(src, dst, 1)
>>         return True
>>
>>
>> I can confirm that the issue no-longer reproduces for me with the
>> following change to FS.py:
>>
>> https://github.com/tychoish/mongo/commit/c8450fb4d304b2de06b
>> a968b71f6efacd3b5214e
>>
>> While I'd love to follow this deeper, debugging python's file system
>> internals on Windows is not something I can really invest time in right
>> now. We are most likely just going to make the above patch to our vendored
>> copy of SCons and continue.
>>
>> Perhaps someone with more Python expertise would be interested in
>> pursuing this further? I can give very detailed reproduction instructions.
>>
>> Thanks,
>> Andrew
>>
>>
>> On Mon, Aug 8, 2016 at 9:55 AM, Jason Kenny <dragon512 at live.com> wrote:
>>
>>> I am curious on what you find, please let us know what you discover.
>>>
>>>
>>>
>>> I am thinking more and more the linker issue is the windows linker
>>> trying to lock the file that prevents any file handles with write
>>> permission to be open on it. That’s is just my gut feeling.
>>>
>>>
>>>
>>> Jason
>>>
>>>
>>>
>>> *From:* Scons-users [mailto:scons-users-bounces at scons.org] *On Behalf
>>> Of *Andrew C. Morrow
>>> *Sent:* Monday, August 8, 2016 7:31 AM
>>> *To:* SCons users mailing list <scons-users at scons.org>
>>> *Subject:* Re: [Scons-users] CacheDir race during parallel Windows
>>> builds?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 7, 2016 at 6:35 PM, Jason Kenny <dragon512 at live.com> wrote:
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> So let me go over what we know:
>>>
>>> 1) no cache and serial build -> worked
>>>
>>> 2) no cache and -j build -> Worked
>>>
>>> 3) cache and serial build -> Worked
>>>
>>> 4) cache and -j build -> Fail constantly
>>>
>>>
>>>
>>> Correct, with two caveats:
>>>
>>>
>>>
>>> 1) I've never actually attempted case 1, on any platform. I can, if you
>>> think it would provide any value, but I'm nearly certain that it works
>>> every time.
>>>
>>> 2) These are the results on Windows; we have so far never observed the
>>> case 4 errors on Linux, OS X, or Solaris.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From this it would seems to be having the cache on and a parallel build.
>>> My guess is that a thread was doing something with the file and the main
>>> thread was doing something else to have this happen.
>>>
>>> Then I did a simple test.
>>>
>>>
>>>
>>> Basically I opened an object file I just built manually in different
>>> python interactive shell I opened it only as “r” and left it open.
>>>
>>> I could link the program in a different shell.
>>>
>>> If it opened the file with “a” or “r+” ( anything with implied write),
>>> the program would not link with a “LINK : fatal error LNK1104: cannot open
>>> file 'hello.obj'”.
>>>
>>>
>>>
>>> I am guessing that the linker has some “exclusive” read mode set that
>>> fails is the object file is opened with a write mode. If I try to do this
>>> on Linux it looks like it works fine even is python has an open handle
>>> Write handle open. Also if I do this with different processes on windows it
>>> seem to be fine as well. I think the linker is locking the file while it
>>> does some work to prevent it from changing while it is busy making the PE
>>> format of the finial output.
>>>
>>>
>>>
>>> Based on this I would suggest we have a race in SCons with cacheDir set
>>> in which python has a write mode handle open on the object file that was
>>> not closed yet. I did this test on Windows 10 with VS 2015 ( I tested linux
>>> on the bash shell feature on windows 10 and doubled checked on Ubuntu in a
>>> VM). The race I would assume to be something with the actions running a
>>> link command while the main thread is doing something with that file. Or
>>> there is something else touching that file.
>>>
>>>
>>>
>>> I don’t know enough of the pathways with cacheDir at the moment to say
>>> want would be going on.
>>>
>>>
>>>
>>> Nor do I. I'm going to enlist the help of one of our local Windows
>>> experts to see if he can help with tooling that will show us exactly what
>>> the conflict is. I'll report back any findings.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I don’t think Parts File tweaks would help much with solving this
>>> problem at the moment. Given 4) is the only time this happen, this
>>> *seems* to be a SCons issue.
>>>
>>>
>>>
>>> I agree that it appears to be, but until we have a root cause it is of
>>> course not possible to be sure.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>>
>>>
>>> _______________________________________________
>>> Scons-users mailing list
>>> Scons-users at scons.org
>>> https://pairlist4.pair.net/mailman/listinfo/scons-users
>>>
>>>
>>
>> _______________________________________________
>> Scons-users mailing list
>> Scons-users at scons.org
>> https://pairlist4.pair.net/mailman/listinfo/scons-users
>>
>>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160811/bde4482e/attachment.html>


More information about the Scons-users mailing list