[Scons-users] CacheDir race during parallel Windows builds?
Jason Kenny
dragon512 at live.com
Sun Aug 7 01:35:29 EDT 2016
Sorry for the delay, I did not see this email as moved to “junk” for some reason.
I do recall having issues with not being able to open a file on windows. There are a couple of reasons this happened
1) Antivirus software
2) The search client in the background indexing the files
3) The way python or other classic “c” program open files on windows
The last is the most important. In parts I added support for long path/file names which is around 32K in length and support for symlink and hardlinks. Honestly lots of the “ugly” code I parts with win32 is based on this support. Some of it is also for supporting win2000/XP systems. That code at this point should be pulled out. The core problem that I found, and reported to some people at Microsoft was the way the fopen() opens a files on windows ( it was being looked, don’t know if it was fixed). It does not have FILE_SHARE_WRITE, FILE_SHARE_DELETE set correctly. This causes a case in which a process can open process to simply read it and prevent other processes from writing to it. This prevents Hardlinks from working as they should on win32 systems.
Anyways to get all this to work I had to move everything to use Win32 Api for file handling. Do this did seem to resolve a number of issues. This should be all ( or mostly done) done as a big monkey patch in parts/overrides/os_file.py. In theory one should be able to just import Parts to the main SConstruct and it should all work. ( I admit I might be wrong with some details for a Raw SCons build.) But given I did fix the last known items It should be simple to test if this helps or not by just “from parts import *” at the start of the Sconstruct ( and install Parts).
We also had random failures do to a case in which a tools ( in our case the Intel tools) would load a header for reading ( and because it was opened with fopen, not missed SHARED_DELETE) a hard link to that file in some SDK area could not be deleted and fail the build ( As SCons wants to delete everything before it does the actions) we hacked some fixes to prevent this from happening. I found that in general the tool form Microsoft seem to open files correctly ( via a win32 call) This leaves Python as a likely cause. If I was to guess what was happening there might be something on with SCons doing some check on the obj file with cacheDir on.
If so the Parts fixes might resolve them, if python/Scons is holding a file handle, given we change the default file permission used to open a file at the python level.
I would also find it interesting to know what happens if they did rebuild right after this failure. I was common for us to do a Scons -j32 … build then if that failed to a Scons -k -j1 … to get a complete list of failures to fix. We noticed on linux or and windows that this helped us see failure that just went away on the second build. This helped us get clues on what was going on.
In this case does the file now link after the Scons process died or is it still locked in the new SCons process?
Does this happen all the time.
On the same file?
And does it happen only on -j build with j>1?
I also recall us having issue with our CI system as they ran in VMs. It seemed at times odd behavior would happen on a VM vs a bare metal machine, however these seemed to show up randomly, not constantly.
Hope this helps
Jason
From: Bill Deegan [mailto:bill at baddogconsulting.com]
Sent: Thursday, August 4, 2016 7:40 PM
To: SCons users mailing list <scons-users at scons.org>; Jason Kenny <dragon512 at live.com>
Subject: Re: [Scons-users] CacheDir race during parallel Windows builds?
Jason,
I know you did a lot of work on Windows builds.
Did you run into anything similar to this?
I notice Parts pulls in win32's CopyFile API
-Bill
On Thu, Aug 4, 2016 at 10:20 AM, Andrew C. Morrow <andrew.c.morrow at gmail.com <mailto:andrew.c.morrow at gmail.com> > wrote:
Hi -
At MongoDB, we recently started using CacheDir in our CI system. This has been a big success for reducing rebuild times for our Linux builds, however, we were surprised to find that our Windows builds started failing in a very alarming way:
Please see the following log file: <https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_debug_compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_08_02_20_24_46/0?type=T> https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_debug_compile_81185a50aeed5b2beed2c0a81b381a482489fdb7_16_08_02_20_24_46/0?type=T
The log lines of interest are:
[2016/08/02 17:31:09.642] Retrieved `build\cached\mongo\base\data_type_terminated_test.obj' from cache
Here, we see that we retrieved this .obj file from the cache. Nine seconds later, we try to use that object in a link step:
[2016/08/02 17:31:18.921] link /nologo /DEBUG /INCREMENTAL:NO /LARGEADDRESSAWARE /OPT:REF /OUT:build\cached\mongo\base\base_test.exe build\cached\mongo\base\data_range.obj ... build\cached\mongo\base\data_type_terminated_test.obj ...
The link fails, claiming that the data_type_terminated_test.obj file cannot be opened:
[2016/08/02 17:31:20.363] LINK : fatal error LNK1104: cannot open file 'build\cached\mongo\base\data_type_terminated_test.obj'
[2016/08/02 17:31:20.506] scons: *** [build\cached\mongo\base\base_test.exe] Error 1104
We are using a vendored copy of SCons 2.5.0. The only modification is this:
https://github.com/mongodb/mongo/commit/bc7e4e6821639ee766ada83483975668af98f367#diff-cc7aec1739634ca2a857a4d4227663aa
This change was made so that the atime of files in the cache is fine-grained accurate, even if the underlying filesystem is mounted noatime or relatime, so that we can prune the cache based on access time. We would like to propose this change to be upstreamed, but that is a separate email.
SCons was invoked as follows from within an SSH session into cygwin (you can see it at the top of the build log as well):
python ./buildscripts/scons.py --dbg=on --opt=on --win-version-min=ws08r2 -j$(( $(grep -c ^processor /proc/cpuinfo) / 2 )) MONGO_DISTMOD=2008plus --cache --cache-dir='z:\data\scons-cache\9d73adcd-19eb-46f2-9988-b8594ba5a3d1' --use-new-tools all dist dist-debugsymbols distsrc-zip MONGO_VERSION=3.3.10-250-g81185a5
The 'python' here is Windows python, not cygwin, and PyWin32 is installed.
The system on which this build ran is running Windows 2012 on a dedicated spot AWS c3.4xlarge instance, and the toolchain is Visual Studio 2015 The Z drive, where the cache directory is located, is locally connected NTFS via AWS ephemeral/instance storage.
We have since backed out using the Cache on our Windows builds, which is disappointing - Windows builds take forever compared to others, and we were really hoping that CacheDir would be a big help here.
Has anyone seen anything like this, or has some ideas what may be going wrong here? I know there have been some other recent threads about problems with Windows and build ordering, but this seems different - the retrieval of the file from the Cache was correctly ordered, but it doesn't appear to have been effective.
I'm happy to provide any additional information if it will help us get Windows CacheDir enabled builds working.
Thanks,
Andrew
_______________________________________________
Scons-users mailing list
Scons-users at scons.org <mailto:Scons-users at scons.org>
https://pairlist4.pair.net/mailman/listinfo/scons-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20160807/d55a0d8d/attachment-0001.html>
More information about the Scons-users
mailing list