This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年10月22日 10:46 by W00D00, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| distances.csv | W00D00, 2009年10月22日 10:46 | csv file with BOM | ||
| Messages (11) | |||
|---|---|---|---|
| msg94340 - (view) | Author: Istvan Szirtes (W00D00) | Date: 2009年10月22日 10:46 | |
The CSV module try to read a .csv file which is coded in utf-8 with utf- 8 BOM. The first row in the csv file is ["value","vocal","vocal","vocal","vocal"] in hex: "value","vocal","vocal","vocal","vocal" the reader can not read corectly the first row and if I try to seek up to 0 somewhere in the file I got an error like this: ['\ufeff"value"', 'vocal', 'vocal', 'vocal', 'vocal'] I think the csv reader is not seekable correctly. I attached a test file for the bug and here is my code: import codecs import csv InDistancesFile = codecs.open( '..\\distances.csv', 'r', encoding='utf- 8' ) InDistancesObj = csv.reader( InDistancesFile ) for Row in InDistancesObj: if Row[0] == '20': print(Row) break InDistancesFile.seek(0) for Row in InDistancesObj: print(Row) |
|||
| msg94341 - (view) | Author: Walter Dörwald (doerwalter) * (Python committer) | Date: 2009年10月22日 14:03 | |
http://docs.python.org/library/csv.html#module-csv states: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples. These restrictions will be removed in the future. |
|||
| msg94345 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2009年10月22日 14:51 | |
The restrictions were theoretically removed in 3.1, and the 3.1 documentation has been updated to reflect that. If 3.1 CSV doesn't handle unicode, then that is a bug. |
|||
| msg94346 - (view) | Author: Walter Dörwald (doerwalter) * (Python committer) | Date: 2009年10月22日 14:53 | |
Then the solution should simply be to use "utf-8-sig" as the encoding, instead of "utf-8". |
|||
| msg94365 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2009年10月22日 16:32 | |
In that case we should update the docs. Istvan, can you confirm that this solves your problem? |
|||
| msg94403 - (view) | Author: Istvan Szirtes (W00D00) | Date: 2009年10月24日 08:13 | |
Hi Everyone, I have tried the "utf-8-sig" and it does not work in this case or rather I think not the csv module is wrong. The seek() does not work correctly in the csv file or object. With "utf-8-sig" the file is opend correctly and the first row does not include the BOM problem. It is great. I am sorry I have not known this until now. (I am not a python expert yet :)) However, I have gote some misstake like this 'AFTE\ufeffVALUE".WAV' during my running script. "AFTER" is a valid string in the given csv file but the BOM follows it. This happens after when I seek up to "0" some times in the csv file. And the string "aftevalue" LEAVE_HIGHWAY-E" is produced which is wrong. My sollution is that I convert the csv object into a list after the file openeing: InDistancesFile = codecs.open( Root, 'r', encoding='utf-8' ) txt = InDistancesFile.read()[1:] # to leave the BOM lines = txt.splitlines()[1:] # to leave the first row which is a header InDistancesObj = list(csv.reader( lines )) # convert the csv reader object into a simple list Many thanks for your help, Istvan |
|||
| msg159483 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年04月27日 19:00 | |
I checked out. Files opened in "utf-8-sig" are seekable.
>>> open('test', 'w', encoding='utf-8-sig').write('qwerty\nйцукен\n')
>>> open('test', 'r', encoding="utf-8").read()
'\ufeffqwerty\nйцукен\n'
>>> open('test', 'r', encoding="utf-8-sig").read()
'qwerty\nйцукен\n'
>>> with open('test', 'r', encoding="utf-8-sig") as f:
... print(ascii(f.readline()))
... f.seek(0)
... print(ascii(f.readline()))
...
'qwerty\n'
0
'qwerty\n'
Should this issue be closed?
|
|||
| msg159490 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2012年04月27日 20:12 | |
Serhiy, the bug is about csv in particular. Can you confirm that using utf-8-sig allows one to process a file with a bom using the csv module? |
|||
| msg159494 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年04月27日 21:26 | |
I ran the script above (only replaced 'utf-8' on 'utf-8-sig') and did not see anything strange. I looked at the source (cvs.py and _cvs.c) and also did not see anything that could lead to this effect. If the bug exists, it in utf-8-sig codec and should be expressed in other cases. There is nothing special for csv. |
|||
| msg159506 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2012年04月28日 00:04 | |
I wasn't sure which script you were referring to, so I checked it myself and got the same results as you: after the seek(0) on the file object opened with utf-8-sig, csv read all the lines in the file, including reading the header line correctly. So, let's close this. |
|||
| msg159509 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年04月28日 05:43 | |
I was referring to the script inlined in the message http://bugs.python.org/issue7185#msg94340 . |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:54 | admin | set | github: 51434 |
| 2012年04月28日 05:43:56 | serhiy.storchaka | set | messages: + msg159509 |
| 2012年04月28日 00:04:08 | r.david.murray | set | status: open -> closed resolution: not a bug messages: + msg159506 stage: needs patch -> resolved |
| 2012年04月27日 21:26:06 | serhiy.storchaka | set | messages: + msg159494 |
| 2012年04月27日 20:12:25 | r.david.murray | set | messages: + msg159490 |
| 2012年04月27日 19:00:45 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg159483 |
| 2010年10月29日 10:07:21 | admin | set | assignee: georg.brandl -> docs@python |
| 2010年05月20日 20:32:22 | skip.montanaro | set | nosy:
- skip.montanaro |
| 2009年10月24日 08:13:54 | W00D00 | set | messages: + msg94403 |
| 2009年10月22日 18:05:16 | skip.montanaro | set | nosy:
+ skip.montanaro |
| 2009年10月22日 16:32:39 | r.david.murray | set | assignee: georg.brandl components: + Documentation versions: + Python 3.2 nosy: + georg.brandl messages: + msg94365 stage: test needed -> needs patch |
| 2009年10月22日 14:53:53 | doerwalter | set | messages: + msg94346 |
| 2009年10月22日 14:51:15 | r.david.murray | set | priority: normal nosy: + r.david.murray messages: + msg94345 type: compile error -> behavior stage: test needed |
| 2009年10月22日 14:03:53 | doerwalter | set | nosy:
+ doerwalter messages: + msg94341 |
| 2009年10月22日 10:46:05 | W00D00 | create | |