This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年01月07日 00:21 by amaury.forgeotdarc, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| io_utf16.patch | amaury.forgeotdarc, 2009年01月07日 00:21 | |||
| Messages (11) | |||
|---|---|---|---|
| msg79299 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2009年01月07日 00:21 | |
First write a utf-16 file with its signature:
>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>> f1.write('0123456789')
>>> f1.close()
Then read it twice:
>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>> print('read1', ascii(f2.read()))
read1 '0123456789'
>>> f2.seek(0)
0
>>> print('read2', ascii(f2.read()))
read2 '\ufeff0123456789'
The second read returns the BOM!
This is because the zero in seek(0) is a "cookie" which contains both the position
and the decoder state. Unfortunately, state=0 means 'endianness has been determined:
native order'.
maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
The patch implement this idea.
|
|||
| msg79325 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月07日 11:46 | |
> This is because the zero in seek(0) is a "cookie"
> which contains both the position and the decoder state.
> Unfortunately, state=0 means 'endianness has been determined:
> native order'.
The problem is maybe that TextIOWrapper._pack_cookie() can create a
cookie=0. Example to create a non-null value, replace:
def _pack_cookie(self, position, ...):
return (position | (dec_flags<<64) | ...
def _unpack_cookie(self, bigint):
rest, position = divmod(bigint, 1<<64)
...
by
def _pack_cookie(self, position, ...):
return (1 | (position<<1) | (dec_flags<<65) | ...
def _unpack_cookie(self, bigint):
if not (bigint & 1):
raise ValueError("invalid cookie")
bigint >>= 1
rest, position = divmod(bigint, 1<<64)
...
Why the cookie is an integer and not an object with attributes?
|
|||
| msg79326 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2009年01月07日 12:03 | |
> The problem is maybe that TextIOWrapper._pack_cookie() > can create a cookie=0 But only when position==0. And in this case, at the beginning of the stream, it makes sense to reset everything to its initial value: zero for the various counts, and call decoder.reset() |
|||
| msg79330 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2009年01月07日 12:59 | |
Well, there are other problems with utf-16, e.g. when opening an
existing file for appending, the BOM is written again:
>>> f = open('utf16.txt', 'w', encoding='utf-16')
>>> f.write('abc')
3
>>> f.close()
>>> f = open('utf16.txt', 'a', encoding='utf-16')
>>> f.write('def')
3
>>> f.close()
>>> open('utf16.txt', 'r', encoding='utf-16').read()
'abc\ufeffdef'
Who said TextIOWrapper was sane? :-o
|
|||
| msg79410 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2009年01月08日 13:50 | |
On 2009年01月07日 01:21, Amaury Forgeot d'Arc wrote:
> First write a utf-16 file with its signature:
>
>>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>>> f1.write('0123456789')
>>>> f1.close()
>
> Then read it twice:
>
>>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>>> print('read1', ascii(f2.read()))
> read1 '0123456789'
>>>> f2.seek(0)
> 0
>>>> print('read2', ascii(f2.read()))
> read2 '\ufeff0123456789'
>
> The second read returns the BOM!
> This is because the zero in seek(0) is a "cookie" which contains both the position
> and the decoder state. Unfortunately, state=0 means 'endianness has been determined:
> native order'.
>
> maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
> The patch implement this idea.
This is a problem with the utf_16.py codec, not the io layer.
Opening a file in append mode is something that the io layer
would have to handle, since the codec doesn't know anything about
the underlying file mode.
Using .reset() will not help. The code for the StreamReader
and StreamWriter in utf_16.py will have to be modified to undo
the adjustment of the .encode() and .decode() method after using
.seek(0).
Note that there's also the case .seek(1) - I guess this must
be considered as resulting in undefined behavior.
|
|||
| msg80211 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2009年01月19日 21:27 | |
I support Amaury's suggestion (actually I implemented it in the io-c branch). Resetting the decoder when seeking to the beginning of the stream is a reasonable way to deal with those incremental decoders for which the start state is something else than (b"", 0). (and, you're right, opening in append mode is a different problem...) |
|||
| msg80222 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月19日 23:25 | |
I opened a different issue (#5006) for the duplicate BOM in append mode. |
|||
| msg83141 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年03月04日 21:36 | |
This has been fixed by the io-c branch merge. |
|||
| msg83167 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月05日 00:24 | |
> This has been fixed by the io-c branch merge. Can you at least include the patch to test_io.py from amaury's patch? And why not fixing the Python version of the io module (i'm not sure of the new name: _pyio?) since we have a working patch? |
|||
| msg83173 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年03月05日 00:42 | |
Ah, I forgot this wasn't applied to the Python implementation. Fixed in r70184. |
|||
| msg83176 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月05日 01:00 | |
@benjamin: ok, great. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:43 | admin | set | github: 49112 |
| 2009年03月05日 01:00:49 | vstinner | set | messages: + msg83176 |
| 2009年03月05日 00:42:45 | benjamin.peterson | set | messages: + msg83173 |
| 2009年03月05日 00:24:46 | vstinner | set | messages: + msg83167 |
| 2009年03月04日 21:36:57 | benjamin.peterson | set | status: open -> closed nosy: + benjamin.peterson resolution: fixed messages: + msg83141 |
| 2009年02月28日 17:25:56 | benjamin.peterson | link | issue4565 dependencies |
| 2009年01月19日 23:25:00 | vstinner | set | messages: + msg80222 |
| 2009年01月19日 21:27:41 | pitrou | set | messages: + msg80211 |
| 2009年01月08日 13:50:50 | lemburg | set | nosy:
+ lemburg messages: + msg79410 |
| 2009年01月07日 12:59:00 | pitrou | set | nosy:
+ pitrou messages: + msg79330 |
| 2009年01月07日 12:03:28 | amaury.forgeotdarc | set | messages: + msg79326 |
| 2009年01月07日 11:46:50 | vstinner | set | messages: + msg79325 |
| 2009年01月07日 09:23:22 | vstinner | set | nosy: + vstinner |
| 2009年01月07日 00:21:16 | amaury.forgeotdarc | create | |