[Scons-users] CacheDir race during parallel Windows builds?
Jason Kenny
dragon512 at live.com
Wed Aug 10 14:22:38 EDT 2016
Thanks Andrew,
This is what I sort of expected. There was a permission issue. It looks like something a classic C program cannot access with using classic C API.
Dirk:
I was thinking I could move my override file in https://bitbucket.org/sconsparts/parts/src/3a389f774f234694994071d784af88c3babaad03/parts/overrides/os_file.py?at=master <https://bitbucket.org/sconsparts/parts/src/3a389f774f234694994071d784af88c3babaad03/parts/overrides/os_file.py?at=master&fileviewer=file-view-default> &fileviewer=file-view-default to override basic file IO in Python maybe and add some more cases to allow SCons to call file based IO stuff with Win32. This looks like this will help solve this issues and other fileIO related issues on windows.
Any thoughts on the best place to put such a file in SCons?
Jason
From: Scons-users [mailto:scons-users-bounces at scons.org] On Behalf Of Andrew C. Morrow
Sent: Wednesday, August 10, 2016 1:07 PM
To: SCons users mailing list <scons-users at scons.org>
Subject: Re: [Scons-users] CacheDir race during parallel Windows builds?
Hi All -
Our resident Windows expert reached the following conclusions:
Summary: It is a python bug in shutil.copy2
The issue:
When link.exe opens a obj file, it gets the following error:
LINK : fatal error LNK1104: cannot open file 'build\cached\mongo\tools\mongobridge_options_init.obj'
To diagnose these errors, I enabled ETW tracing with "FileIO stackwalk for FileCreate <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964768(v=vs.85).aspx> +FileCleanup+FileClose <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964773(v=vs.85).aspx> " and cranked through WPA with the data on Window 2008 R2 & Window 2012 R2. The 2012 R2 os offers stack traces which is why I used it
By tracing the build, I can see that link.exe has a call to CreateFile fail with
"A file cannot be opened because the share access flags are incompatible. (0xc0000043)"
This occurs because it asked for a file with the following flags "file_open synchronous_io_nonalert non_directory_file shareRead", and another process had an existing handle to the file"file_overwrite_if synchronous_io_nonalert non_directory_file normal shareRead shareWrite".
The existing process that had a handle to the file was none other then "python.exe" which created the file originally in copy2, but did not close it. I compared normal cases, and it does succesfully close the file. I do not know why like 1/100 or 1/200 times it fails.
The workaround is win32file.CopyFile <http://timgolden.me.uk/python/win32_how_do_i/copy-a-file.html> .
I patched FS.py with and it worked fine
win32file.CopyFile(src, dst, 1)
return True
I can confirm that the issue no-longer reproduces for me with the following change to FS.py:
<https://github.com/tychoish/mongo/commit/c8450fb4d304b2de06ba968b71f6efacd3b5214e> https://github.com/tychoish/mongo/commit/c8450fb4d304b2de06ba968b71f6efacd3b5214e
While I'd love to follow this deeper, debugging python's file system internals on Windows is not something I can really invest time in right now. We are most likely just going to make the above patch to our vendored copy of SCons and continue.
Perhaps someone with more Python expertise would be interested in pursuing this further? I can give very detailed reproduction instructions.
Thanks,
Andrew
On Mon, Aug 8, 2016 at 9:55 AM, Jason Kenny <dragon512 at live.com <mailto:dragon512 at live.com> > wrote:
I am curious on what you find, please let us know what you discover.
I am thinking more and more the linker issue is the windows linker trying to lock the file that prevents any file handles with write permission to be open on it. That’s is just my gut feeling.
Jason
From: Scons-users [mailto:scons-users-bounces at scons.org <mailto:scons-users-bounces at scons.org> ] On Behalf Of Andrew C. Morrow
Sent: Monday, August 8, 2016 7:31 AM
To: SCons users mailing list <scons-users at scons.org <mailto:scons-users at scons.org> >
Subject: Re: [Scons-users] CacheDir race during parallel Windows builds?
On Sun, Aug 7, 2016 at 6:35 PM, Jason Kenny <dragon512 at live.com <mailto:dragon512 at live.com> > wrote:
Hi,
So let me go over what we know:
1) no cache and serial build -> worked
2) no cache and -j build -> Worked
3) cache and serial build -> Worked
4) cache and -j build -> Fail constantly
Correct, with two caveats:
1) I've never actually attempted case 1, on any platform. I can, if you think it would provide any value, but I'm nearly certain that it works every time.
2) These are the results on Windows; we have so far never observed the case 4 errors on Linux, OS X, or Solaris.
>From this it would seems to be having the cache on and a parallel build. My guess is that a thread was doing something with the file and the main thread was doing something else to have this happen.
Then I did a simple test.
Basically I opened an object file I just built manually in different python interactive shell I opened it only as “r” and left it open.
I could link the program in a different shell.
If it opened the file with “a” or “r+” ( anything with implied write), the program would not link with a “LINK : fatal error LNK1104: cannot open file 'hello.obj'”.
I am guessing that the linker has some “exclusive” read mode set that fails is the object file is opened with a write mode. If I try to do this on Linux it looks like it works fine even is python has an open handle Write handle open. Also if I do this with different processes on windows it seem to be fine as well. I think the linker is locking the file while it does some work to prevent it from changing while it is busy making the PE format of the finial output.
Based on this I would suggest we have a race in SCons with cacheDir set in which python has a write mode handle open on the object file that was not closed yet. I did this test on Windows 10 with VS 2015 ( I tested linux on the bash shell feature on windows 10 and doubled checked on Ubuntu in a VM). The race I would assume to be something with the actions running a link command while the main thread is doing something with that file. Or there is something else touching that file.
I don’t know enough of the pathways with cacheDir at the moment to say want would be going on.
Nor do I. I'm going to enlist the help of one of our local Windows experts to see if he can help with tooling that will show us exactly what the conflict is. I'll report back any findings.
I don’t think Parts File tweaks would help much with solving this problem at the moment. Given 4) is the only time this happen, this seems to be a SCons issue.
I agree that it appears to be, but until we have a root cause it is of course not possible to be sure.
Thanks,
Andrew
_______________________________________________
Scons-users mailing list
Scons-users at scons.org <mailto:Scons-users at scons.org>
https://pairlist4.pair.net/mailman/listinfo/scons-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160810/b9e4290c/attachment-0001.html>
More information about the Scons-users
mailing list