This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2018年09月25日 18:05 by nascheme, last changed 2022年04月11日 14:59 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| codecs_bug.py | nascheme, 2018年09月25日 18:05 | |||
| Messages (5) | |||
|---|---|---|---|
| msg326382 - (view) | Author: Neil Schemenauer (nascheme) * (Python committer) | Date: 2018年09月25日 18:05 | |
This seems to be a bug in codecs.getreader(). io.TextIOWrapper(fp, encoding) works correctly. |
|||
| msg327071 - (view) | Author: Karthikeyan Singaravelan (xtreak) * (Python committer) | Date: 2018年10月04日 18:08 | |
codecs.getreader('utf-8')(open('test.txt', 'rb')) during iteration str.splitlines on the decoded data that takes '\x0b' as a valid newline as specified in [0] being a superset of universal newlines. Thus splits on '\x0b' as a valid newline for string and works correctly.
./python.exe
Python 3.8.0a0 (heads/master:6f85b826b5, Oct 4 2018, 22:44:36)
[Clang 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'first line\x0b\x0bblah blah\nsecond line\n' # returned by codecs.getreader()
>>> a.splitlines(keepends=True)
['first line\x0b', '\x0b', 'blah blah\n', 'second line\n']
# for bytes bytes.splitlines works only on universal-newlines thus doesn't split on '\x0b' [1]
>>> b = b'first line\x0b\x0bblah blah\nsecond line\n'
>>> b.splitlines(keepends=True)
[b'first line\x0b\x0bblah blah\n', b'second line\n']
But io.TextIOWrapper only accepts None, '', '\n', '\r\n' and '\r' as newline for text mode but for binary files it's different as noted in readline to accept only '\n' [2]
> The line terminator is always b'\n' for binary files; for text
> files, the newlines argument to open can be used to select the line
> terminator(s) recognized.
Thus 'first line\x0b\x0bblah blah\nsecond line\n' gives ['first line\x0b\x0bblah blah\n', 'second line\n'] . Trying to use '\x0b' as new line results in illegal newline error in TextIOWrapper.
Hope I am correct on the above analysis.
[0] https://docs.python.org/3.8/library/stdtypes.html#str.splitlines
[1] https://docs.python.org/3.8/library/stdtypes.html#bytes.splitlines
[2] https://docs.python.org/3/library/io.html#io.TextIOBase.readline
|
|||
| msg327082 - (view) | Author: Neil Schemenauer (nascheme) * (Python committer) | Date: 2018年10月04日 20:17 | |
Thank you for the research. The problem is indeed that \v is getting treated as a line separator. That is an intentional design choice, see: https://bugs.python.org/issue12855 It would seem to have some surprising implications for CSV parsing. E.g. if someone embeds a \v character in a quoted field, parsing the file using codecs.getreader() will cause the field to be split across two rows. Someone else has run into the same issue: https://www.enigma.com/blog/the-secret-world-of-newline-characters I'm not sure anything should be done. Perhaps we should do something to reduce that chances that people trip over this issue. E.g. if I want to parse a file containing Unicode text with the CSV module, how do I do it while allowing \v characters (or other new-line like characters other than \n) within fields? |
|||
| msg327084 - (view) | Author: Neil Schemenauer (nascheme) * (Python committer) | Date: 2018年10月04日 20:38 | |
Perhaps the 'csv' module should do some sanity checking on the file passed to the reader. The docs recommend that newline='' be used to open the file. Maybe 'csv' could check that and warn if its not the case. I poked around but it seems like io files don't have a handy property to check for that. Further, maybe 'csv' could check if the file is a codecs.StreamReader object. In that case, there is no way to turn off the extra newline characters and so that's probably a bug. |
|||
| msg327085 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2018年10月04日 21:01 | |
This is a duplicate of issue18291. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:59:06 | admin | set | github: 78982 |
| 2018年10月04日 21:01:39 | serhiy.storchaka | set | status: open -> closed superseder: codecs.open interprets FS, RS, GS as line ends nosy: + serhiy.storchaka messages: + msg327085 resolution: duplicate stage: resolved |
| 2018年10月04日 20:38:13 | nascheme | set | messages: + msg327084 |
| 2018年10月04日 20:17:40 | nascheme | set | messages: + msg327082 |
| 2018年10月04日 18:08:31 | xtreak | set | nosy:
+ xtreak messages: + msg327071 |
| 2018年09月25日 18:05:24 | nascheme | create | |