[Scons-users] AddPostAction memoization problem

Fri May 31 10:33:41 EDT 2024

On 5/31/24 04:20, Mike Haboustak wrote:
> On Wed, May 29, 2024 at 5:41 PM Mats Wichmann <mats at wichmann.us> wrote:
>>
>> On 5/29/24 15:27, Mike Haboustak wrote:
>>> On Wed, May 29, 2024 at 9:49 AM Mats Wichmann <mats at wichmann.us> wrote:
>>>>
>>>> I'd personally agree as to correct behavior: if a step that makes
>>>> changes has completed, information related to the node should be up to
>>>> date and not old.  I guess it's worth mentioning that the getsize()
>>>> method isn't *officially* exposed as a public interface, it's possible
>>>> there was an actual reason for that back in the mists of time.
>>>>
>>>
>>> Is it true that a function isn't officially exposed if it doesn't have
>>> a docstring in the API docs? The documentation includes
>>> SCons.Node.FS.File.get_size() right next to get_relpath().
>>> https://scons.org/doc/latest/HTML/scons-api/SCons.Node/#SCons.Node.FS.File.get_size
>>
>> No, the "public API" is the manpage.  Unfortunately, the "API Docs"
>> don't distinguish between public and internal reliably (we eventually
>> added that notation to the main page).  Every so often there's a
>> docstring annotation to help out, but it's not frequent.
>>
> 
> I spent some time yesterday reading the man page, which I haven't done
> in years, and also the notes at the beginning of the SCons API
> documentation to better understand your perspective on the public
> SCons interface. What's clear to me is that my build systems extend
> SCons with many custom Builders and Tools to encapsulate reusable
> build components. My role as a tool and builder author involves
> understanding and calling SCons API functions.
> 
> When I wrote my initial post, I thought that the caching/memoization
> feature of SCons was part of an infrastructure layer that lived below
> the SCons API and that in my role as a tool author I should not need
> to be aware that SCons was performing this optimization internally.
> Unfortunately, I'm no longer convinced that this is correct or
> possible. 

You should not need to.  We explained how the current implementation 
causes the behavior you've observed, but that's all... from my 
perspective, memoizing the stat structure must have been tempting 
because it's comparatively expensive - the filesystem layer has to 
initiate a call to top operating system to fetch file information from 
wherever it comes from, and if it's the result of a just-changed file 
it's probably not in the OS cache layer either. When i quickly 
instrumented the node layer and ran an example very similar to yours, 
still just one source file and one target file, the stat method gets 
called five times for the source file and eight for the target; the 
other memozing method that's affected (get_size) is called three times 
for each.

Unfortunately, IMO, stat() is a poor candidate for memoization because 
the return of the Node's stat() call is not invariant (when they teach 
the memoization concept it's nearly always Fibonacci, which is prefect: 
fibonacci(4) is always the same value no matter what you do). Here, any 
time the disk file changes, whether from non-existing->existing, or 
old-contents->new-contents, the call to that function with its node 
object as the argument would return something different if not memoized. 
And as you've mentioned, any action in a chain could cause a change. The 
executor, which fires off those actions, is in a poor position to know 
whether any particular action did cause a change to an underlying disk 
file - it just knows how to invoke the action, not what its results will 
be.  It's also impractical to have each action figure that out - if 
they're function actions it's possible but would have to be coded, but 
in the more common case of a command action, where a command line is 
handed off to the operating system via a subprocess call it's outside 
the SCons process so opaque.  So the executor could invalidate on each 
step whether or not it's needed, or just stop memoizing the stat 
structure.  So much for internal details. I have no idea whether either 
approach is palatable to SCons.

But you as the build maintainer are not supposed to have to know about 
those details, because behind-the-scenes caching for performance reasons 
really should work as you initially expected.

> The only way this ActionFunction can get correct information from
> Node.FS.get_size() is to have a Node.invalidate() method that allows
> the author to clear a target's cached values. Every API function that
> memoizes its return value needs to be marked in the documentation, so
> that we know to invalidate the Node any time we modify the file on
> disk. The Action executor would use the same API on targets and side
> effects either before or after running an Action.

I suppose that's an approach that would reduce the amount of recoding 
needed in SCons itself - but it seems poor to have to paper around a bad 
assumption in the SCons implementation (namely that a whole action chain 
can be treated as a single unit, and you only invalidate when it's done) 
by exposing through documentation a whole bunch more information that 
should be internal. But I'm only one opinion...  If it's not, in the 
end, agreed to make changes internally, we'll need to document for 
AddPostAction that actions may not see changes to node details made by 
previous actions on the same target.