[Scons-users] SCons 3.0.0.alpha.20170614 available on testpypi
Tim Jenness
tjenness at lsst.org
Mon Jul 31 19:00:20 EDT 2017
It might help if I attach the relevant file.
After some more investigation I can report that this file uses latin-1 (or equivalent) rather than utf-8.
>>> u'\xdf'
'ß'
>>> u'\xdf'.encode('utf-8')
b'\xc3\x9f'
>>> u'\xdf'.encode('latin-1')
b'\xdf'
The chardet package does manage to detect the content with something that works:
>>> chardet.detect(xx)
{'encoding': 'windows-1252', 'confidence': 0.7017158671586715}
The real question is whether SCons is going to encounter decoding errors in parts of the file that are relevant to the scanner. Ignoring decoding errors is at least more correct than returning bytes (the latter option breaking the API contract, resulting in breakage downstream). It looks like chardet is GPL so that’s not going to help SCons.
> On Jul 31, 2017, at 15:17 , Tim Jenness <tjenness at lsst.org> wrote:
>
> I just found Bill’s reply to me on the mail archive but I don’t remember receiving them. Sorry about that.
>
> I’m not sure how to write a test case as such. An example file from boost::python that fails is http://www.boost.org/doc/libs/1_64_0/boost/python/detail/dealloc.hpp <http://www.boost.org/doc/libs/1_64_0/boost/python/detail/dealloc.hpp> (although I imagine you can’t copy and paste from that link and would need to get the file from the source distribution). This file has a bad character in the author’s name that can’t be decoded and triggers errors when bytes turn up from get_text_contents(). The test should have a file with that problem and see if the scanner can read it.
>
> —
> Tim Jenness
>
>
>> On Jul 31, 2017, at 14:57 , Tim Jenness <tjenness at lsst.org <mailto:tjenness at lsst.org>> wrote:
>>
>>
>>> On Jul 21, 2017, at 15:19 , Tim Jenness <tjenness at lsst.org <mailto:tjenness at lsst.org>> wrote:
>>>
>>> Now that I’ve thought about it a bit more I think the underlying problem is in engine/SCons/Node/FS.py around line 2630:
>>>
>>> 2608 def get_text_contents(self):
>>> 2609 """
>>> 2610 This attempts to figure out what the encoding of the text is
>>> 2611 based upon the BOM bytes, and then decodes the contents so that
>>> 2612 it's a valid python string.
>>> 2613 """
>>> 2614 contents = self.get_contents()
>>> 2615 # The behavior of various decode() methods and functions
>>> 2616 # w.r.t. the initial BOM bytes is different for different
>>> 2617 # encodings and/or Python versions. ('utf-8' does not strip
>>> 2618 # them, but has a 'utf-8-sig' which does; 'utf-16' seems to
>>> 2619 # strip them; etc.) Just sidestep all the complication by
>>> 2620 # explicitly stripping the BOM before we decode().
>>> 2621 if contents[:len(codecs.BOM_UTF8)] == codecs.BOM_UTF8:
>>> 2622 return contents[len(codecs.BOM_UTF8):].decode('utf-8')
>>> 2623 if contents[:len(codecs.BOM_UTF16_LE)] == codecs.BOM_UTF16_LE:
>>> 2624 return contents[len(codecs.BOM_UTF16_LE):].decode('utf-16-le')
>>> 2625 if contents[:len(codecs.BOM_UTF16_BE)] == codecs.BOM_UTF16_BE:
>>> 2626 return contents[len(codecs.BOM_UTF16_BE):].decode('utf-16-be')
>>> 2627 try:
>>> 2628 return contents.decode()
>>> 2629 except (UnicodeDecodeError, AttributeError) as e:
>>> 2630 return contents
>>>
>>> The problem is that if we fail to convert the bytes to Unicode the method returns the “text” contents in bytes. This breaks the contract of get_text_contents() promising to return a string.
>>>
>>> Removing the try block at line 2627 and instead using “return contents.decode(errors=“ignore”)” fixes my boost.python build problem.
>>>
>>
>> This problem still exists at https://bitbucket.org/scons/scons/src/cfbc036995c8669e296cc0427655345241a0097e/src/engine/SCons/Node/FS.py?at=default&fileviewer=file-view-default#FS.py-2630 <https://bitbucket.org/scons/scons/src/cfbc036995c8669e296cc0427655345241a0097e/src/engine/SCons/Node/FS.py?at=default&fileviewer=file-view-default#FS.py-2630>
>>
>> Should I be discussing this on the dev list? Sorry, but I don’t have time to learn how to do a bitbucket PR using hg. The good news is our 500,000 lines of python and C++ code built fine with my suggested fix in place.
>>
>> Patch is at https://github.com/lsst/scons/blob/tickets/DM-8560/patches/0001-always-decode-contents.patch <https://github.com/lsst/scons/blob/tickets/DM-8560/patches/0001-always-decode-contents.patch>
>>
>> —
>> Tim Jenness
>>
>> _______________________________________________
>> Scons-users mailing list
>> Scons-users at scons.org <mailto:Scons-users at scons.org>
>> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20170731/222be3e9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dealloc.hpp
Type: application/octet-stream
Size: 542 bytes
Desc: not available
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20170731/222be3e9/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20170731/222be3e9/attachment-0001.html>
More information about the Scons-users
mailing list