[Scons-users] CacheDir race during parallel Windows builds?

Sun Aug 7 10:48:25 EDT 2016

Hi Jason -

Thanks for getting back to me. Comments below:

On Sun, Aug 7, 2016 at 1:35 AM, Jason Kenny <dragon512 at live.com> wrote:

>
>
> Sorry for the delay, I did not see this email as moved to “junk” for some
> reason.
>
>
>
> I do recall having issues with not being able to open a file on windows.
> There are a couple of reasons this happened
>
>
>
> 1)      Antivirus software
>
There is none installed on the system beyond anything intrinsic to Windows,
and if that were interfering, I'd expect a very low rate of occurrence
since it would seem to me that you would need to be unlucky and visit a
file in some way at the same time as AV. The rate of occurrence is high
(see below).

> 2)      The search client in the background indexing the files
>
The failures are far too common for this to always the the cause, basically
the same argument as above.

> 3)      The way python or other classic “c” program open files on windows
>
More on that below.

>
>
> The last is the most important. In parts I added support for long
> path/file names which is around 32K in length  and support for symlink and
> hardlinks. Honestly lots of the “ugly” code I parts with win32 is based on
> this support. Some of it is also for supporting win2000/XP systems. That
> code at this point should be pulled out. The core problem that I found, and
> reported to some people at Microsoft was the way the fopen() opens a files
> on windows ( it was being looked, don’t know if it was fixed). It does not
> have FILE_SHARE_WRITE, FILE_SHARE_DELETE set correctly. This causes a case
> in which a process can open process to simply read it and prevent other
> processes from writing to it. This prevents Hardlinks from working as they
> should on win32 systems.
>

I don't see how that could be a problem here. Once an object file has been
copied out of the cache, nothing is trying to write to the copied file,
only the linker is trying to read it. With the fix for the file handle leak
applied, the only way this could happen would be if shutil.copy2 failed to
close the file, and it also uses the with ... as pattern.

I suppose the other way it could happen would be if SCons decided to try to
start the link step while the copy of a needed object file out of the cache
was still in progress, so that the file in the variant dir being copied to
was still open for write. That would seem like the sort of error that could
occur in a parallel build if somehow the signal that something had been
extracted from the cache was emitted prematurely (like, when the copy
started, rather than when it finished), allowing another thread to begin to
execute the dependent link task too early.

But then, if that was the case, why haven't we seen this on Linux builds?

>
> Anyways to get all this to work I had to move everything to use Win32 Api
> for file handling. Do this did seem to resolve a number of issues. This
> should be all ( or mostly done) done as a big monkey patch in
> parts/overrides/os_file.py. In theory one *should* be able to just import
> Parts to the main SConstruct and it should all work. ( I admit I might be
> wrong with some details for a Raw SCons build.) But given I did fix the
> last known items It should be simple to test if this helps or not by just
> “from parts import *” at the start of the Sconstruct ( and install Parts).
>

If the Python Windows file handling is buggy and renders SCons unreliable,
shouldn't this part of parts be upstreamed into SCons?

>
>
> We also had random failures do to a case in which a tools ( in our case
> the Intel tools) would load a header for reading ( and because it was
> opened with fopen, not missed SHARED_DELETE) a hard link to that file in
> some SDK area could not be deleted and fail the build ( As SCons wants to
> delete everything before it does the actions) we hacked some fixes to
> prevent this from happening. I found that in general the tool form
> Microsoft seem to open files correctly ( via a win32 call) This leaves
> Python as a likely cause. If I was to guess what was happening there might
> be something on with SCons doing some check on the obj file with cacheDir
> on.
>
> If so the Parts fixes might resolve them, if python/Scons is holding a
> file handle, given we change the default file permission used to open a
> file at the python level.
>

I didn't really understand this part.

Overall, the information about parts is useful background, but adding yet
another layer of complexity does not seem like a step in the right
direction while trying to diagnose this. Adding Parts might indeed make the
problem go away, but I still wouldn't know why it did so, or what the root
cause was.

I think it is important to find that out.

>
>
> I would also find it interesting to know what happens if they did rebuild
> right after this failure. I was common for us to do a Scons -j32 … build
> then if that failed to a Scons -k -j1 … to get a complete list of failures
> to fix.  We noticed on linux or and windows that this helped us see failure
> that just went away on the second build. This helped us get clues on what
> was going on.
>

I do this frequently in my development workflow if I have, for instance, a
very localized compile error that I know how to fix but want to see how the
rest of the build goes, but it doesn't help for our CI builds. Our -j10
builds on Windows already take upwards of 40 minutes, which is why we are
so interested in CacheDir. Backing down to serial after a failure isn't an
option. It would also be very confusing. Given a failed build step, how do
I distinguish between the case of a build that failed due to flaky build
system behavior (and, therefore, fall back to serial) and a build that
failed because of a legitimate compile error (in which case I should stop)?
Because In the latter case, continuing with a -k build is just going to
emit an astronomical logfile full of compiler errors and burn a ton of time
and CPU.

>
> In this case does the file now link after the Scons process died or is it
> still locked in the new SCons process?
>

If I repeatedly invoke SCons, each build gets some more targets linked.
Eventually, I get to the end. The setup here is that the cache dir is
populated, but all linked executables need to be rebuilt. Starting with an
empty variant dir, the first run links many targets but eventually fails.
Restarting the build while leaving the artifacts from run 1 in the variant
dir makes further progress, but also eventually fails. After 5 or 6 cycles
of this, enough targets have been built that the last run has only a little
work to do at all, and the build passes.

Does this happen all the time.
>

Yes, every time, if I start from the populated cache dir and empty
variant_dir state, where all executables require re-link, I have never
observed a successful one-shot parallel build when CacheDir is enabled.

> On the same file?
>

No, different files are seen as non-existing/unreadable by the linker each
time. It is truly unfortunate that Microsoft in their wisdom failed to make
the error reporting for LNK1104 (https://msdn.microsoft.com/
en-us/library/ts7eyw4s.aspx) more detailed. As it is, there is no way to
see whether the linker simply couldn't find the file because it didn't
exist, or found it but was unable to read it due to either permissions or
locking. I will try the above suggestion of processmonitor to try to find
out which is really happening.

> And does it happen only on -j build with j>1?
>

I'm only interested in -j>1 builds, really. As mentioned above, serial
builds would take hours. However in the interest of determining whether
parallelism is at fault, I ran one. Again, this is starting with a
populated cache dir and empty variant dir. Basically, all the .obj files
are in the cache, but all of the executables need to be re-linked.

Without CacheDir enabled but without -jN, it worked in one shot.

So we know that parallel builds without CacheDir work because we have been
doing them forever. And we know that serial builds with CacheDir work,
because I just did one. And we know that parallel builds with cache dir
work on Linux, because our developers have been doing it forever, and we
turned it on for Linux builds in our CI system without incident.

The problem, whatever it may be, appears to be limited to (or, perhaps, is
only evident during) parallel CacheDir enabled builds on Windows.

>
>
> I also recall us having issue with our CI system as they ran in VMs. It
> seemed at times odd behavior would happen on a VM vs a bare metal machine,
> however these seemed to show up randomly, not constantly.
>

Our (and, increasingly, the rest of the world) entire CI system is oriented
around AWS and other VM'ish providers, so I don't really think that
advocating bare metal as a workaround is actionable. Note however that
there still is randomness here: the problem occurs with different
files/targets every time.

>
>
> Hope this helps
>

I do appreciate your sharing your thoughts on this. We had been building up
to large scale deployment of CacheDir in our CI system for some time, and
with high hopes for dramatically improved build times, so there is definite
frustration that CacheDir seems not to work on the platform where we needed
it most.

Thanks,
Andrew

>
>
> Jason
>
>
>
> *From:* Bill Deegan [mailto:bill at baddogconsulting.com]
> *Sent:* Thursday, August 4, 2016 7:40 PM
> *To:* SCons users mailing list <scons-users at scons.org>; Jason Kenny <
> dragon512 at live.com>
> *Subject:* Re: [Scons-users] CacheDir race during parallel Windows builds?
>
>
>
> Jason,
>
> I know you did a lot of work on Windows builds.
>
> Did you run into anything similar to this?
>
> I notice Parts pulls in win32's CopyFile API
>
> -Bill
>
>
>
> On Thu, Aug 4, 2016 at 10:20 AM, Andrew C. Morrow <
> andrew.c.morrow at gmail.com> wrote:
>
>
>
> Hi -
>
>
>
> At MongoDB, we recently started using CacheDir in our CI system. This has
> been a big success for reducing rebuild times for our Linux builds,
> however, we were surprised to find that our Windows builds started failing
> in a very alarming way:
>
>
>
> Please see the following log file: https://evergreen.mongod
> b.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_
> debug_compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_
> 08_02_20_24_46/0?type=T
> <https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_debug_compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_08_02_20_24_46/0?type=T>
>
>
>
> The log lines of interest are:
>
>
>
> [2016/08/02 17:31:09.642] Retrieved `build\cached\mongo\base\data_type_terminated_test.obj'
> from cache
>
>
>
> Here, we see that we retrieved this .obj file from the cache. Nine seconds
> later, we try to use that object in a link step:
>
>
>
> [2016/08/02 17:31:18.921] link /nologo /DEBUG /INCREMENTAL:NO
> /LARGEADDRESSAWARE /OPT:REF /OUT:build\cached\mongo\base\base_test.exe
> build\cached\mongo\base\data_range.obj ...   build\cached\mongo\base\data_type_terminated_test.obj
> ...
>
>
>
> The link fails, claiming that the data_type_terminated_test.obj file
> cannot be opened:
>
>
>
> [2016/08/02 17:31:20.363] LINK : fatal error LNK1104: cannot open file
> 'build\cached\mongo\base\data_type_terminated_test.obj'
>
>  [2016/08/02 17:31:20.506] scons: *** [build\cached\mongo\base\base_test.exe]
> Error 1104
>
>
>
> We are using a vendored copy of SCons 2.5.0. The only modification is this:
>
>
>
> https://github.com/mongodb/mongo/commit/bc7e4e6821639ee766ad
> a83483975668af98f367#diff-cc7aec1739634ca2a857a4d4227663aa
>
>
>
> This change was made so that the atime of files in the cache is
> fine-grained accurate, even if the underlying filesystem is mounted noatime
> or relatime, so that we can prune the cache based on access time. We would
> like to propose this change to be upstreamed, but that is a separate email.
>
>
>
> SCons was invoked as follows from within an SSH session into cygwin (you
> can see it at the top of the build log as well):
>
>
>
> python ./buildscripts/scons.py --dbg=on --opt=on --win-version-min=ws08r2
> -j$(( $(grep -c ^processor /proc/cpuinfo) / 2 )) MONGO_DISTMOD=2008plus
> --cache --cache-dir='z:\data\scons-cache\9d73adcd-19eb-46f2-9988-b8594ba5a3d1'
> --use-new-tools all dist dist-debugsymbols distsrc-zip
>  MONGO_VERSION=3.3.10-250-g81185a5
>
>
>
> The 'python' here is Windows python, not cygwin, and PyWin32 is installed.
>
>
>
> The system on which this build ran is running Windows 2012 on a dedicated
> spot AWS c3.4xlarge instance, and the toolchain is Visual Studio 2015 The
> Z drive, where the cache directory is located, is locally connected NTFS
> via AWS ephemeral/instance storage.
>
>
>
> We have since backed out using the Cache on our Windows builds, which is
> disappointing - Windows builds take forever compared to others, and we were
> really hoping that CacheDir would be a big help here.
>
>
>
> Has anyone seen anything like this, or has some ideas what may be going
> wrong here? I know there have been some other recent threads about problems
> with Windows and build ordering, but this seems different - the retrieval
> of the file from the Cache was correctly ordered, but it doesn't appear to
> have been effective.
>
>
>
> I'm happy to provide any additional information if it will help us get
> Windows CacheDir enabled builds working.
>
>
>
> Thanks,
>
> Andrew
>
>
>
>
>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160807/0a62fc45/attachment-0001.html>