Message97341
| Author |
vstinner |
| Recipients |
vstinner |
| Date |
2010年01月07日.03:03:54 |
| SpamBayes Score |
2.6084103e-07 |
| Marked as misclassified |
No |
| Message-id |
<1262833437.24.0.308267564541.issue7651@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
If the file starts with a BOM, open(filename) should be able to guess the charset. It would be helpful for many high level modules:
- #7519: ConfigParser
- #7185: csv
- and any module using open() to read a text file
Actually, the user have to choose between UTF-8 and UTF-8-SIG to skip the UTF-8 BOM. For UTF-16, the user have to specify UTF-16-LE or UTF-16-BE, even if the file starts with a BOM (which should be the case most the time).
The idea is to delay the creation of the decoder and the encoder. Just after reading the first chunk: try to guess the charset by searching for a BOM (if the charset is unknown). If the BOM is found, fallback to current guess code (os.device_charset() or locale.getpreferredencoding()).
Concerned charsets: UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE. Binary files are not concerned. If the encoding is specified to open(), the behaviour is unchanged.
I wrote a proof of concept, but there are still open issues:
- append mode: should we seek at zero to read the BOM?
old=tell(); seek(0); bytes=read(4); seek(old); search_bom(bytes)
- read+write: should we guess the charset using the BOM if the first action is a write? or only search for a BOM if the first action is a read? |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2010年01月07日 03:03:57 | vstinner | set | recipients:
+ vstinner |
| 2010年01月07日 03:03:57 | vstinner | set | messageid: <1262833437.24.0.308267564541.issue7651@psf.upfronthosting.co.za> |
| 2010年01月07日 03:03:55 | vstinner | link | issue7651 messages |
| 2010年01月07日 03:03:54 | vstinner | create |
|