[Scons-users] ClassicCPP scanner fails with UTF-8 BOM on non-UTF8 file
Damjan Jovanovic
damjan.jov at gmail.com
Mon Jun 29 00:29:41 EDT 2020
Hi
I am trying to port Apache OpenOffice to build with SCons, and while it's
still early, so far it's been a fantastic replacement for its current
"gbuild" system based on GNU make with unmaintainable custom eval()-based
logic. It has all the advanced build features we need usable in such a
clear and simple way.
When it comes to dependency scanning however, I've found a potential
problem. Our custom l10n translation system generates .res files from .src
files in 4 steps:
source/newerverwarn.src
| transex3
v
SrsPartMergeTarget/newerverwarn.src
| rsc
v
SrsPartTarget/newerverwarn.src
| cat files together
v
SrsTarget/uui.srs
| rsc
v
ResTarget/uuien-US.res
All that already works (and is half the code and 1000 times more readable
than GNU make). The problem is that .src files can #include other files.
When I add a dependency scanner for .src files:
env.Append(SCANNERS=ClassicCPP("AOOSRCScanner", '.src', "CPPPATH", '^[
\t]*#[ \t]*(?:include|import)[ \t]*(<|")([^>"]+)(>|")'))
which I want to scan the first file (source/newerverwarn.src), it also
scans the 2nd and 3rd .src files. With translation off it works perfectly.
However with translation on, the 2nd/3rd files are unusual, in that they
start with a UTF-8 BOM but the remainder of the file isn't UTF-8 but rather
latin-1 (a sample file is attached). This causes scons to exit with an
exception:
scons: ***
[solver/450/unxfbsdx/workdir/scons/uui/Res/SrsTarget/uui/res.srs]
UnicodeDecodeError : 'utf-8' codec can't decode byte 0xc3 in position 3552:
invalid continuation byte
I could hack File.get_text_contents() in engine/SCons/Node/FS.py to catch
exceptions when parsing in utf-8 with the BOM removed and retry in latin-1.
But an even better solution would be to limit the ClassicCPP scanner to
only scan the .src files under the source/ directory and ignore the
generated files in other directories. Is there some way to do that?
Thank you
Damjan
P.S. Standalone test with the sample file:
import sys
import codecs
contents = open(sys.argv[1], 'rb').read()
if contents[:len(codecs.BOM_UTF8)] == codecs.BOM_UTF8:
print(sys.argv[1] + " starts with UTF-8 BOM")
else:
print(sys.argv[1] + " does not start with UTF-8 BOM")
try:
contents.decode('utf-8')
print('Decoded in utf-8')
except UnicodeDecodeError as e:
try:
contents.decode('latin-1')
print('Decoded in latin-1')
except UnicodeDecodeError as e:
contents.decode('utf-8', error='backslashreplace')
print('Decoded in utf-8 with backslashreplace')
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20200629/264ef3c8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: newerverwarn.zip
Type: application/zip
Size: 17158 bytes
Desc: not available
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20200629/264ef3c8/attachment-0001.zip>
More information about the Scons-users
mailing list