[Scons-users] SCons 3.0.0.alpha.20170614 available on testpypi
Bill Deegan
bill at baddogconsulting.com
Mon Jul 31 19:48:17 EDT 2017
Maybe dump checking the BOM's and a series of try/excepts
utf-8, utf-16, windows-1252, then errors=ignore?
On Mon, Jul 31, 2017 at 4:00 PM, Tim Jenness <tjenness at lsst.org> wrote:
>
> It might help if I attach the relevant file.
>
> After some more investigation I can report that this file uses latin-1 (or
> equivalent) rather than utf-8.
>
> >>> u'\xdf'
> 'ß'
> >>> u'\xdf'.encode('utf-8')
> b'\xc3\x9f'
> >>> u'\xdf'.encode('latin-1')
> b'\xdf'
>
> The chardet package does manage to detect the content with something that
> works:
>
> >>> chardet.detect(xx)
> {'encoding': 'windows-1252', 'confidence': 0.7017158671586715}
>
> The real question is whether SCons is going to encounter decoding errors
> in parts of the file that are relevant to the scanner. Ignoring decoding
> errors is at least more correct than returning bytes (the latter option
> breaking the API contract, resulting in breakage downstream). It looks like
> chardet is GPL so that’s not going to help SCons.
>
>
>
> On Jul 31, 2017, at 15:17 , Tim Jenness <tjenness at lsst.org> wrote:
>
> I just found Bill’s reply to me on the mail archive but I don’t remember
> receiving them. Sorry about that.
>
> I’m not sure how to write a test case as such. An example file from
> boost::python that fails is http://www.boost.org/doc/
> libs/1_64_0/boost/python/detail/dealloc.hpp (although I imagine you can’t
> copy and paste from that link and would need to get the file from the
> source distribution). This file has a bad character in the author’s name
> that can’t be decoded and triggers errors when bytes turn up from
> get_text_contents(). The test should have a file with that problem and see
> if the scanner can read it.
>
> —
> Tim Jenness
>
>
> On Jul 31, 2017, at 14:57 , Tim Jenness <tjenness at lsst.org> wrote:
>
>
> On Jul 21, 2017, at 15:19 , Tim Jenness <tjenness at lsst.org> wrote:
>
> Now that I’ve thought about it a bit more I think the underlying problem
> is in engine/SCons/Node/FS.py around line 2630:
>
> 2608 def get_text_contents(self):
> 2609 """
> 2610 This attempts to figure out what the encoding of the text
> is
> 2611 based upon the BOM bytes, and then decodes the contents so
> that
> 2612 it's a valid python string.
> 2613 """
> 2614 contents = self.get_contents()
> 2615 # The behavior of various decode() methods and functions
> 2616 # w.r.t. the initial BOM bytes is different for different
> 2617 # encodings and/or Python versions. ('utf-8' does not
> strip
> 2618 # them, but has a 'utf-8-sig' which does; 'utf-16' seems to
> 2619 # strip them; etc.) Just sidestep all the complication by
> 2620 # explicitly stripping the BOM before we decode().
> 2621 if contents[:len(codecs.BOM_UTF8)] == codecs.BOM_UTF8:
> 2622 return contents[len(codecs.BOM_UTF8):].decode('utf-8')
> 2623 if contents[:len(codecs.BOM_UTF16_LE)] ==
> codecs.BOM_UTF16_LE:
> 2624 return contents[len(codecs.BOM_UTF16_
> LE):].decode('utf-16-le')
> 2625 if contents[:len(codecs.BOM_UTF16_BE)] ==
> codecs.BOM_UTF16_BE:
> 2626 return contents[len(codecs.BOM_UTF16_
> BE):].decode('utf-16-be')
> 2627 try:
> 2628 return contents.decode()
> 2629 except (UnicodeDecodeError, AttributeError) as e:
> 2630 return contents
>
> The problem is that if we fail to convert the bytes to Unicode the method
> returns the “text” contents in bytes. This breaks the contract of
> get_text_contents() promising to return a string.
>
> Removing the try block at line 2627 and instead using “return
> contents.decode(errors=“ignore”)” fixes my boost.python build problem.
>
>
> This problem still exists at https://bitbucket.org/scons/scons/src/
> cfbc036995c8669e296cc0427655345241a0097e/src/engine/SCons/
> Node/FS.py?at=default&fileviewer=file-view-default#FS.py-2630
>
> Should I be discussing this on the dev list? Sorry, but I don’t have time
> to learn how to do a bitbucket PR using hg. The good news is our 500,000
> lines of python and C++ code built fine with my suggested fix in place.
>
> Patch is at https://github.com/lsst/scons/blob/tickets/DM-8560/
> patches/0001-always-decode-contents.patch
>
> —
> Tim Jenness
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
> _______________________________________________
> Scons-users mailing list
> Scons-users at scons.org
> https://pairlist4.pair.net/mailman/listinfo/scons-users
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist4.pair.net/pipermail/scons-users/attachments/20170731/9bd8c37f/attachment-0001.html>
More information about the Scons-users
mailing list