[Scons-users] CacheDir race during parallel Windows builds?
Andrew C. Morrow
andrew.c.morrow at gmail.com
Wed Aug 10 14:07:18 EDT 2016
Hi All -
Our resident Windows expert reached the following conclusions:
Summary: It is a python bug in shutil.copy2
The issue:
When link.exe opens a obj file, it gets the following error:
LINK : fatal error LNK1104: cannot open file 'build\cached\mongo\tools\mong
obridge_options_init.obj'
To diagnose these errors, I enabled ETW tracing with "FileIO stackwalk for
FileCreate
<https://msdn.microsoft.com/en-us/library/windows/desktop/aa964768(v=vs.85).aspx>
+FileCleanup+FileClose"
<https://msdn.microsoft.com/en-us/library/windows/desktop/aa964773(v=vs.85).aspx>
and
cranked through WPA with the data on Window 2008 R2 & Window 2012 R2. The
2012 R2 os offers stack traces which is why I used it
By tracing the build, I can see that link.exe has a call to CreateFile fail
with
"A file cannot be opened because the share access flags are
incompatible. (0xc0000043)"
This occurs because it asked for a file with the following flags "file_open
synchronous_io_nonalert non_directory_file shareRead", and another process
had an existing handle to the file"file_overwrite_if
synchronous_io_nonalert non_directory_file normal shareRead shareWrite".
The existing process that had a handle to the file was none other then
"python.exe" which created the file originally in copy2, but did not close
it. I compared normal cases, and it does succesfully close the file. I do
not know why like 1/100 or 1/200 times it fails.
The workaround is win32file.CopyFile
<http://timgolden.me.uk/python/win32_how_do_i/copy-a-file.html>.
I patched FS.py with and it worked fine
win32file.CopyFile(src, dst, 1)
return True
I can confirm that the issue no-longer reproduces for me with the following
change to FS.py:
https://github.com/tychoish/mongo/commit/c8450fb4d304b2de06ba968b71f6efacd3b5214e
While I'd love to follow this deeper, debugging python's file system
internals on Windows is not something I can really invest time in right
now. We are most likely just going to make the above patch to our vendored
copy of SCons and continue.
Perhaps someone with more Python expertise would be interested in pursuing
this further? I can give very detailed reproduction instructions.
Thanks,
Andrew
On Mon, Aug 8, 2016 at 9:55 AM, Jason Kenny <dragon512 at live.com> wrote:
> I am curious on what you find, please let us know what you discover.
>
>
>
> I am thinking more and more the linker issue is the windows linker trying
> to lock the file that prevents any file handles with write permission to be
> open on it. That’s is just my gut feeling.
>
>
>
> Jason
>
>
>
> *From:* Scons-users [mailto:scons-users-bounces at scons.org] *On Behalf Of *Andrew
> C. Morrow
> *Sent:* Monday, August 8, 2016 7:31 AM
> *To:* SCons users mailing list <scons-users at scons.org>
> *Subject:* Re: [Scons-users] CacheDir race during parallel Windows builds?
>
>
>
>
>
>
>
> On Sun, Aug 7, 2016 at 6:35 PM, Jason Kenny <dragon512 at live.com> wrote:
>
>
>
> Hi,
>
>
>
> So let me go over what we know:
>
> 1) no cache and serial build -> worked
>
> 2) no cache and -j build -> Worked
>
> 3) cache and serial build -> Worked
>
> 4) cache and -j build -> Fail constantly
>
>
>
> Correct, with two caveats:
>
>
>
> 1) I've never actually attempted case 1, on any platform. I can, if you
> think it would provide any value, but I'm nearly certain that it works
> every time.
>
> 2) These are the results on Windows; we have so far never observed the
> case 4 errors on Linux, OS X, or Solaris.
>
>
>
>
>
>
>
> From this it would seems to be having the cache on and a parallel build.
> My guess is that a thread was doing something with the file and the main
> thread was doing something else to have this happen.
>
> Then I did a simple test.
>
>
>
> Basically I opened an object file I just built manually in different
> python interactive shell I opened it only as “r” and left it open.
>
> I could link the program in a different shell.
>
> If it opened the file with “a” or “r+” ( anything with implied write), the
> program would not link with a “LINK : fatal error LNK1104: cannot open file
> 'hello.obj'”.
>
>
>
> I am guessing that the linker has some “exclusive” read mode set that
> fails is the object file is opened with a write mode. If I try to do this
> on Linux it looks like it works fine even is python has an open handle
> Write handle open. Also if I do this with different processes on windows it
> seem to be fine as well. I think the linker is locking the file while it
> does some work to prevent it from changing while it is busy making the PE
> format of the finial output.
>
>
>
> Based on this I would suggest we have a race in SCons with cacheDir set in
> which python has a write mode handle open on the object file that was not
> closed yet. I did this test on Windows 10 with VS 2015 ( I tested linux on
> the bash shell feature on windows 10 and doubled checked on Ubuntu in a
> VM). The race I would assume to be something with the actions running a
> link command while the main thread is doing something with that file. Or
> there is something else touching that file.
>
>
>
> I don’t know enough of the pathways with cacheDir at the moment to say
> want would be going on.
>
>
>
> Nor do I. I'm going to enlist the help of one of our local Windows experts
> to see if he can help with tooling that will show us exactly what the
> conflict is. I'll report back any findings.
>
>
>
>
>
>
>
> I don’t think Parts File tweaks would help much with solving this problem
> at the moment. Given 4) is the only time this happen, this *seems* to be
> a SCons issue.
>
>
>
> I agree that it appears to be, but until we have a root cause it is of
> course not possible to be sure.
>
>
>
> Thanks,
>
> Andrew
>
>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160810/92530e95/attachment.html>
More information about the Scons-users
mailing list