This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年01月07日 03:03 by vstinner, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| open_bom.patch | vstinner, 2010年01月07日 23:18 | review | ||
| open_bom-2.patch | vstinner, 2010年01月07日 23:41 | review | ||
| open_bom-3.patch | vstinner, 2010年01月08日 10:23 | review | ||
| Messages (12) | |||
|---|---|---|---|
| msg97341 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年01月07日 03:03 | |
If the file starts with a BOM, open(filename) should be able to guess the charset. It would be helpful for many high level modules: - #7519: ConfigParser - #7185: csv - and any module using open() to read a text file Actually, the user have to choose between UTF-8 and UTF-8-SIG to skip the UTF-8 BOM. For UTF-16, the user have to specify UTF-16-LE or UTF-16-BE, even if the file starts with a BOM (which should be the case most the time). The idea is to delay the creation of the decoder and the encoder. Just after reading the first chunk: try to guess the charset by searching for a BOM (if the charset is unknown). If the BOM is found, fallback to current guess code (os.device_charset() or locale.getpreferredencoding()). Concerned charsets: UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE. Binary files are not concerned. If the encoding is specified to open(), the behaviour is unchanged. I wrote a proof of concept, but there are still open issues: - append mode: should we seek at zero to read the BOM? old=tell(); seek(0); bytes=read(4); seek(old); search_bom(bytes) - read+write: should we guess the charset using the BOM if the first action is a write? or only search for a BOM if the first action is a read? |
|||
| msg97366 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年01月07日 19:49 | |
You should ask on the mailing-list (python-dev) because this is an important behaviour change which I'm not sure will get accepted. |
|||
| msg97386 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年01月07日 23:18 | |
open_bom.patch is the proof of concept. It only works in read mode. The idea is to delay the creation of the encoding and the decoder. We wait for just after the first read_chunk().
The patch changes the default behaviour of open(): if the file starts with a BOM, the BOM is used but skipped. Example:
-------------
from _pyio import open
with open('test.txt', 'w', encoding='utf-8-sig') as fp:
print("abc", file=fp)
print("d\xe9f", file=fp)
with open('test.txt', 'r') as fp:
print("open().read(): {!r}".format(fp.read()))
-------------
Unpatched Python displays '\ufeffabc\ndéf\n', whereas patched Python displays 'abc\ndéf\n'.
|
|||
| msg97389 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年01月07日 23:41 | |
Oops, fix read() method of my previous patch. |
|||
| msg97406 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年01月08日 10:23 | |
New version of the patch which is shorter, cleaner, fix the last bug (seek) and don't change the default behaviour anymore (checking for BOM is now explicit): * BOM checking is now optional (explicit): use open(filename, encoding="BOM"). open(filename, "w", encoding="BOM") raises a ValueError. * Create a BOMS dictionary directly in the codecs module * Fix TextIOWrapper for seek(0) (add _has_bom attribute) * Add an unit test for read() and readlines() * Read the encoding property before the first read gives None I also removed the _get_encoding() method (hack). |
|||
| msg97455 - (view) | Author: Walter Dörwald (doerwalter) * (Python committer) | Date: 2010年01月09日 10:45 | |
IMHO this is the wrong approach. As Martin v. Löwis suggested here http://mail.python.org/pipermail/python-dev/2010-January/094841.html the best solution would be a new codec (which he named sniff), that autodetects the encoding on reading. This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. |
|||
| msg103082 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年04月13日 20:12 | |
The link has gone. Is this the message you’re refering to? http://mail.python.org/pipermail/python-dev/2010-January/097115.html Regards |
|||
| msg103084 - (view) | Author: Walter Dörwald (doerwalter) * (Python committer) | Date: 2010年04月13日 20:25 | |
Yes, that's the posting I was referring to. I wonder why the link is gone. |
|||
| msg111416 - (view) | Author: Łukasz Langa (lukasz.langa) * (Python committer) | Date: 2010年07月24日 02:15 | |
I agree with MvL that this is a broader issue that shouldn't be patched in user code (e.g. #7519) but on the codec level. The sniff codec idea seems neat. |
|||
| msg164853 - (view) | Author: Łukasz Langa (lukasz.langa) * (Python committer) | Date: 2012年07月07日 14:08 | |
After reading the mailing list thread at http://mail.python.org/pipermail/python-dev/2010-January/097102.html and waging on other concerns (e.g. how to behave on write-only and read-write modes), it looks like a PEP might be necessary to solve this once and for all. |
|||
| msg164854 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2012年07月07日 14:13 | |
For the implementation part, there's something which already plays with the BOM in the tokenize module. See tokenize.open(), which uses tokenize.detect_encoding() to read the BOM in some cases. |
|||
| msg178876 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年01月03日 01:24 | |
The idea was somehow rejected on the python-dev mailing list. I'm not really motivated to work on this issue since I never see any file starting with a BOM on Linux, and I'm only working on Linux. So I just close this issue. If someone is motivated to work on this topic, I suppose that it would be better to reopen the discussion on the python-dev mailing list first. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:56 | admin | set | github: 51900 |
| 2015年01月08日 00:51:26 | jdufresne | set | nosy:
+ jdufresne |
| 2015年01月06日 19:52:05 | r.david.murray | link | issue23178 superseder |
| 2013年01月03日 01:24:56 | vstinner | set | status: open -> closed resolution: rejected messages: + msg178876 |
| 2012年07月07日 14:13:21 | flox | set | nosy:
+ flox messages: + msg164854 |
| 2012年07月07日 14:08:50 | lukasz.langa | set | type: enhancement |
| 2012年07月07日 14:08:34 | lukasz.langa | set | messages:
+ msg164853 versions: + Python 3.4, - Python 2.7, Python 3.2 |
| 2012年03月20日 12:33:40 | lukasz.langa | set | assignee: lukasz.langa |
| 2012年03月20日 12:32:44 | lukasz.langa | link | issue14311 superseder |
| 2010年07月25日 09:09:02 | BreamoreBoy | link | issue7519 superseder |
| 2010年07月24日 02:15:01 | lukasz.langa | set | nosy:
+ lukasz.langa messages: + msg111416 |
| 2010年04月13日 20:25:47 | doerwalter | set | messages: + msg103084 |
| 2010年04月13日 20:12:43 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg103082 |
| 2010年01月09日 10:45:56 | doerwalter | set | nosy:
+ doerwalter messages: + msg97455 |
| 2010年01月08日 10:23:46 | vstinner | set | files:
+ open_bom-3.patch messages: + msg97406 |
| 2010年01月07日 23:41:19 | vstinner | set | files:
+ open_bom-2.patch messages: + msg97389 |
| 2010年01月07日 23:18:49 | vstinner | set | files:
+ open_bom.patch keywords: + patch messages: + msg97386 |
| 2010年01月07日 19:49:18 | pitrou | set | nosy:
+ pitrou messages: + msg97366 |
| 2010年01月07日 03:03:55 | vstinner | create | |