[Scons-users] AddPostAction memoization problem
Mats Wichmann
mats at wichmann.us
Fri May 31 10:33:41 EDT 2024
On 5/31/24 04:20, Mike Haboustak wrote:
> On Wed, May 29, 2024 at 5:41 PM Mats Wichmann <mats at wichmann.us> wrote:
>>
>> On 5/29/24 15:27, Mike Haboustak wrote:
>>> On Wed, May 29, 2024 at 9:49 AM Mats Wichmann <mats at wichmann.us> wrote:
>>>>
>>>> I'd personally agree as to correct behavior: if a step that makes
>>>> changes has completed, information related to the node should be up to
>>>> date and not old. I guess it's worth mentioning that the getsize()
>>>> method isn't *officially* exposed as a public interface, it's possible
>>>> there was an actual reason for that back in the mists of time.
>>>>
>>>
>>> Is it true that a function isn't officially exposed if it doesn't have
>>> a docstring in the API docs? The documentation includes
>>> SCons.Node.FS.File.get_size() right next to get_relpath().
>>> https://scons.org/doc/latest/HTML/scons-api/SCons.Node/#SCons.Node.FS.File.get_size
>>
>> No, the "public API" is the manpage. Unfortunately, the "API Docs"
>> don't distinguish between public and internal reliably (we eventually
>> added that notation to the main page). Every so often there's a
>> docstring annotation to help out, but it's not frequent.
>>
>
> I spent some time yesterday reading the man page, which I haven't done
> in years, and also the notes at the beginning of the SCons API
> documentation to better understand your perspective on the public
> SCons interface. What's clear to me is that my build systems extend
> SCons with many custom Builders and Tools to encapsulate reusable
> build components. My role as a tool and builder author involves
> understanding and calling SCons API functions.
>
> When I wrote my initial post, I thought that the caching/memoization
> feature of SCons was part of an infrastructure layer that lived below
> the SCons API and that in my role as a tool author I should not need
> to be aware that SCons was performing this optimization internally.
> Unfortunately, I'm no longer convinced that this is correct or
> possible.
You should not need to. We explained how the current implementation
causes the behavior you've observed, but that's all... from my
perspective, memoizing the stat structure must have been tempting
because it's comparatively expensive - the filesystem layer has to
initiate a call to top operating system to fetch file information from
wherever it comes from, and if it's the result of a just-changed file
it's probably not in the OS cache layer either. When i quickly
instrumented the node layer and ran an example very similar to yours,
still just one source file and one target file, the stat method gets
called five times for the source file and eight for the target; the
other memozing method that's affected (get_size) is called three times
for each.
Unfortunately, IMO, stat() is a poor candidate for memoization because
the return of the Node's stat() call is not invariant (when they teach
the memoization concept it's nearly always Fibonacci, which is prefect:
fibonacci(4) is always the same value no matter what you do). Here, any
time the disk file changes, whether from non-existing->existing, or
old-contents->new-contents, the call to that function with its node
object as the argument would return something different if not memoized.
And as you've mentioned, any action in a chain could cause a change. The
executor, which fires off those actions, is in a poor position to know
whether any particular action did cause a change to an underlying disk
file - it just knows how to invoke the action, not what its results will
be. It's also impractical to have each action figure that out - if
they're function actions it's possible but would have to be coded, but
in the more common case of a command action, where a command line is
handed off to the operating system via a subprocess call it's outside
the SCons process so opaque. So the executor could invalidate on each
step whether or not it's needed, or just stop memoizing the stat
structure. So much for internal details. I have no idea whether either
approach is palatable to SCons.
But you as the build maintainer are not supposed to have to know about
those details, because behind-the-scenes caching for performance reasons
really should work as you initially expected.
> The only way this ActionFunction can get correct information from
> Node.FS.get_size() is to have a Node.invalidate() method that allows
> the author to clear a target's cached values. Every API function that
> memoizes its return value needs to be marked in the documentation, so
> that we know to invalidate the Node any time we modify the file on
> disk. The Action executor would use the same API on targets and side
> effects either before or after running an Action.
I suppose that's an approach that would reduce the amount of recoding
needed in SCons itself - but it seems poor to have to paper around a bad
assumption in the SCons implementation (namely that a whole action chain
can be treated as a single unit, and you only invalidate when it's done)
by exposing through documentation a whole bunch more information that
should be internal. But I'm only one opinion... If it's not, in the
end, agreed to make changes internally, we'll need to document for
AddPostAction that actions may not see changes to node details made by
previous actions on the same target.
More information about the Scons-users
mailing list