[Scons-users] CacheDir race during parallel Windows builds?

William Blevins wblevins001 at gmail.com
Wed Aug 10 15:31:46 EDT 2016


Jason,

I assume you would be replacing or merging into SCons.Platform.win32.py
<https://bitbucket.org/scons/scons/src/4e1b77a684f40c9df0d66bd2da415815c436f6e3/src/engine/SCons/Platform/win32.py?at=default&fileviewer=file-view-default>

V/R,
William

On Wed, Aug 10, 2016 at 7:22 PM, Jason Kenny <dragon512 at live.com> wrote:

> Thanks Andrew,
>
>
>
> This is what I sort of expected. There was a permission issue. It looks
> like something a classic C program cannot access with using classic C API.
>
>
>
> Dirk:
>
> I was thinking I could move my override file in https://bitbucket.org/
> sconsparts/parts/src/3a389f774f234694994071d784af88
> c3babaad03/parts/overrides/os_file.py?at=master&fileviewer=
> file-view-default  to override basic file IO in Python maybe and add some
> more cases to allow SCons to call file based IO stuff with Win32. This
> looks like this will help solve this issues and other fileIO related issues
> on windows.
>
>
>
> Any thoughts on the best place to put such a file in SCons?
>
>
>
> Jason
>
>
>
> *From:* Scons-users [mailto:scons-users-bounces at scons.org] *On Behalf Of *Andrew
> C. Morrow
> *Sent:* Wednesday, August 10, 2016 1:07 PM
>
> *To:* SCons users mailing list <scons-users at scons.org>
> *Subject:* Re: [Scons-users] CacheDir race during parallel Windows builds?
>
>
>
>
>
> Hi All -
>
>
>
> Our resident Windows expert reached the following conclusions:
>
>
>
> Summary: It is a python bug in shutil.copy2
>
>
>
> The issue:
>
> When link.exe opens a obj file, it gets the following error:
>
> LINK : fatal error LNK1104: cannot open file 'build\cached\mongo\tools\
> mongobridge_options_init.obj'
>
>
>
> To diagnose these errors, I enabled ETW tracing with "FileIO stackwalk for
> FileCreate
> <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964768(v=vs.85).aspx>
> +FileCleanup+FileClose"
> <https://msdn.microsoft.com/en-us/library/windows/desktop/aa964773(v=vs.85).aspx> and
> cranked through WPA with the data on Window 2008 R2 & Window 2012 R2. The
> 2012 R2 os offers stack traces which is why I used it
>
>
>
> By tracing the build, I can see that link.exe has a call to CreateFile
> fail with
>
> "A file cannot be opened because the share access flags are
> incompatible. (0xc0000043)"
>
>
>
> This occurs because it asked for a file with the following flags "file_open
> synchronous_io_nonalert non_directory_file shareRead", and another
> process had an existing handle to the file"file_overwrite_if
> synchronous_io_nonalert non_directory_file normal shareRead shareWrite".
>
>
>
> The existing process that had a handle to the file was none other then
> "python.exe" which created the file originally in copy2, but did not close
> it. I compared normal cases, and it does succesfully close the file. I do
> not know why like 1/100 or 1/200 times it fails.
>
>
>
> The workaround is win32file.CopyFile
> <http://timgolden.me.uk/python/win32_how_do_i/copy-a-file.html>.
>
>
>
> I patched FS.py with and it worked fine
>
> win32file.CopyFile(src, dst, 1)
>
>         return True
>
>
>
> I can confirm that the issue no-longer reproduces for me with the
> following change to FS.py:
>
>
>
> https://github.com/tychoish/mongo/commit/c8450fb4d304b2de06ba968b71f6ef
> acd3b5214e
>
>
>
> While I'd love to follow this deeper, debugging python's file system
> internals on Windows is not something I can really invest time in right
> now. We are most likely just going to make the above patch to our vendored
> copy of SCons and continue.
>
>
>
> Perhaps someone with more Python expertise would be interested in pursuing
> this further? I can give very detailed reproduction instructions.
>
>
>
> Thanks,
>
> Andrew
>
>
>
>
>
> On Mon, Aug 8, 2016 at 9:55 AM, Jason Kenny <dragon512 at live.com> wrote:
>
> I am curious on what you find, please let us know what you discover.
>
>
>
> I am thinking more and more the linker issue is the windows linker trying
> to lock the file that prevents any file handles with write permission to be
> open on it. That’s is just my gut feeling.
>
>
>
> Jason
>
>
>
> *From:* Scons-users [mailto:scons-users-bounces at scons.org] *On Behalf Of *Andrew
> C. Morrow
> *Sent:* Monday, August 8, 2016 7:31 AM
> *To:* SCons users mailing list <scons-users at scons.org>
> *Subject:* Re: [Scons-users] CacheDir race during parallel Windows builds?
>
>
>
>
>
>
>
> On Sun, Aug 7, 2016 at 6:35 PM, Jason Kenny <dragon512 at live.com> wrote:
>
>
>
> Hi,
>
>
>
> So let me go over what we know:
>
> 1) no cache and serial build -> worked
>
> 2) no cache and -j build -> Worked
>
> 3) cache and serial build -> Worked
>
> 4) cache and -j build -> Fail constantly
>
>
>
> Correct, with two caveats:
>
>
>
> 1) I've never actually attempted case 1, on any platform. I can, if you
> think it would provide any value, but I'm nearly certain that it works
> every time.
>
> 2) These are the results on Windows; we have so far never observed the
> case 4 errors on Linux, OS X, or Solaris.
>
>
>
>
>
>
>
> From this it would seems to be having the cache on and a parallel build.
> My guess is that a thread was doing something with the file and the main
> thread was doing something else to have this happen.
>
> Then I did a simple test.
>
>
>
> Basically I opened an object file I just built manually in different
> python interactive shell I opened it only as “r” and left it open.
>
> I could link the program in a different shell.
>
> If it opened the file with “a” or “r+” ( anything with implied write), the
> program would not link with a “LINK : fatal error LNK1104: cannot open file
> 'hello.obj'”.
>
>
>
> I am guessing that the linker has some “exclusive” read mode set that
> fails is the object file is opened with a write mode. If I try to do this
> on Linux it looks like it works fine even is python has an open handle
> Write handle open. Also if I do this with different processes on windows it
> seem to be fine as well. I think the linker is locking the file while it
> does some work to prevent it from changing while it is busy making the PE
> format of the finial output.
>
>
>
> Based on this I would suggest we have a race in SCons with cacheDir set in
> which python has a write mode handle open on the object file that was not
> closed yet. I did this test on Windows 10 with VS 2015 ( I tested linux on
> the bash shell feature on windows 10 and doubled checked on Ubuntu in a
> VM). The race I would assume to be something with the actions running a
> link command while the main thread is doing something with that file. Or
> there is something else touching that file.
>
>
>
> I don’t know enough of the pathways with cacheDir at the moment to say
> want would be going on.
>
>
>
> Nor do I. I'm going to enlist the help of one of our local Windows experts
> to see if he can help with tooling that will show us exactly what the
> conflict is. I'll report back any findings.
>
>
>
>
>
>
>
> I don’t think Parts File tweaks would help much with solving this problem
> at the moment. Given 4) is the only time this happen, this *seems* to be
> a SCons issue.
>
>
>
> I agree that it appears to be, but until we have a root cause it is of
> course not possible to be sure.
>
>
>
> Thanks,
>
> Andrew
>
>
>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160810/fbc4fcc0/attachment-0001.html>


More information about the Scons-users mailing list