[Scons-users] ClassicCPP scanner fails with UTF-8 BOM on non-UTF8 file

Mon Jun 29 00:29:41 EDT 2020

Hi

I am trying to port Apache OpenOffice to build with SCons, and while it's
still early, so far it's been a fantastic replacement for its current
"gbuild" system based on GNU make with unmaintainable custom eval()-based
logic. It has all the advanced build features we need usable in such a
clear and simple way.

When it comes to dependency scanning however, I've found a potential
problem. Our custom l10n translation system generates .res files from .src
files in 4 steps:

source/newerverwarn.src
 |  transex3
 v
SrsPartMergeTarget/newerverwarn.src
 | rsc
 v
SrsPartTarget/newerverwarn.src
 | cat files together
 v
SrsTarget/uui.srs
 | rsc
v
ResTarget/uuien-US.res

All that already works (and is half the code and 1000 times more readable
than GNU make). The problem is that .src files can #include other files.
When I add a dependency scanner for .src files:

env.Append(SCANNERS=ClassicCPP("AOOSRCScanner", '.src', "CPPPATH", '^[
\t]*#[ \t]*(?:include|import)[ \t]*(<|")([^>"]+)(>|")'))

which I want to scan the first file (source/newerverwarn.src), it also
scans the 2nd and 3rd .src files. With translation off it works perfectly.
However with translation on, the 2nd/3rd files are unusual, in that they
start with a UTF-8 BOM but the remainder of the file isn't UTF-8 but rather
latin-1 (a sample file is attached). This causes scons to exit with an
exception:

scons: ***
[solver/450/unxfbsdx/workdir/scons/uui/Res/SrsTarget/uui/res.srs]
UnicodeDecodeError : 'utf-8' codec can't decode byte 0xc3 in position 3552:
invalid continuation byte

I could hack File.get_text_contents() in engine/SCons/Node/FS.py to catch
exceptions when parsing in utf-8 with the BOM removed and retry in latin-1.
But an even better solution would be to limit the ClassicCPP scanner to
only scan the .src files under the source/ directory and ignore the
generated files in other directories. Is there some way to do that?

Thank you
Damjan

P.S. Standalone test with the sample file:

import sys
import codecs

contents = open(sys.argv[1], 'rb').read()
if contents[:len(codecs.BOM_UTF8)] == codecs.BOM_UTF8:
    print(sys.argv[1] + " starts with UTF-8 BOM")
else:
    print(sys.argv[1] + " does not start with UTF-8 BOM")

try:
    contents.decode('utf-8')
    print('Decoded in utf-8')
except UnicodeDecodeError as e:
    try:
        contents.decode('latin-1')
        print('Decoded in latin-1')
    except UnicodeDecodeError as e:
        contents.decode('utf-8', error='backslashreplace')
        print('Decoded in utf-8 with backslashreplace')
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20200629/264ef3c8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: newerverwarn.zip
Type: application/zip
Size: 17158 bytes
Desc: not available
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20200629/264ef3c8/attachment-0001.zip>