This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年08月13日 23:06 by belopolsky, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Messages (16) | |||
|---|---|---|---|
| msg113849 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年08月13日 23:06 | |
For example: $ ./python.exe Tools/scripts/untabify.py Modules/_heapqmodule.c Traceback (most recent call last): ... (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 173: invalid continuation byte I am not sure what relevant C standard has to say about using non-ascii characters in comments, but the checking tool should not fail with a traceback in such situation. |
|||
| msg114198 - (view) | Author: PCManticore (Claudiu.Popa) * (Python triager) | Date: 2010年08月18日 06:01 | |
Hello. As it seems, untabify.py opens the file using the builtin function open, making the call error-prone when encountering non-ascii character. The proper handling should be done by using open from codecs library, specifying the encoding as argument. e.g. codecs.open(filename, mode, 'utf-8') instead of simply open(filename, mode). |
|||
| msg114948 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年08月25日 23:26 | |
The builtin open in 3.2 is similar to codecs.open. If you read the error message closely, you’ll see that the decoding that failed did try to use UTF-8. The cause of the problem here is that the bytes used for the ç in François’ name are not valid UTF-8; I can fix that. This does not change the original purpose of this report: untabify should not die. |
|||
| msg115517 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年09月03日 22:14 | |
Fixed encoding error in r84472 through r84474. This bug should be reassessed and retitled. If untabify fails because a file has an incorrect encoding, is it really a problem in untabify? This is a developer’s tool, so getting a traceback here seems okay to me. Alexander, please close if you agree. |
|||
| msg115527 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年09月03日 22:47 | |
> If untabify fails because a file has an incorrect encoding, is it really > a problem in untabify? This is a developer’s tool, so getting a > traceback here seems okay to me. I disagree. I think we should use this opportunity to clarify preferred encoding for C language source files in python and make untabify produce meaningful diagnostic in case of encoding errors. As a matter of policy, I see two possibilities: 1. Restrict C sources to 7-bit ASCII. (A pedantic reading of ANSI C standard would probably suggest even more restricted character set, but practically, I don't think 7-bit ASCII in C comments is likely to cause problems for any tools. 2. Require UTF-8 encoding for non-ASCII characters. Given that this is the default for python source code, it is likely that tools that are used for python development can handle UTF-8. My vote is for #1. Display of non-ascii characters is still not universally supported and they are likely to be clobbered when diffs are copied in e-mails etc. |
|||
| msg115534 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年09月03日 22:58 | |
I agree about the need to define the encoding for comments. My vote goes to #2, since I wouldn’t want to see names of authors/contributors mangled in the source. I would reconsider if a specification explicitly forbade that. I repeat that the title of this bug is misleading: untabify does not fail with non-ASCII bytes, it failed because of invalid bytes. |
|||
| msg115540 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年09月03日 23:18 | |
> I wouldn’t want to see names of authors/contributors mangled > in the source. This is a reason to write names in ASCII. While Latin-1 is a grey area because most of it's characters look familiar to English-speaking developers, I don't think you will easily recognize my name if I write it in Cyrillic and even if you do, chances are you would not be able to search for it. On the other hand, everyone who uses e-mail is likely to have a preferred ASCII spelling of his/her name. |
|||
| msg115548 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年09月04日 00:25 | |
>> I wouldn’t want to see names of authors/contributors mangled >> in the source. > > This is a reason to write names in ASCII. Oh, sorry, by "mangled" I meant "forced into ASCII". I was not speaking about mojibake. > While Latin-1 is a grey area because most of [its] characters look familiar > to English-speaking developers, I don’t think there is an argument for Latin-1. Also, Latin-1 does not have characters but bytes, which are displayed as characters by good editors, like UTF-8 bytes are. The discussion is about ASCII versus UTF-8 in my opinion, let Latin-1 rest in peace. > I don't think you will easily recognize my name if I write it in Cyrillic > and even if you do, chances are you would not be able to search for it. Not so good example, since I’ve seen your name in the thread about Misc/ACKS sorting and could recognize it, by I get your idea :) To search, I would use the "search for word under cursor" functionality. > On the other hand, everyone who uses e-mail is likely to have a preferred > ASCII spelling of his/her name. Well, some languages have rules to handle constrained environments, like German who may use oe for ö or Italian E' for È, but for example in French there is no such workaround. Leaving accents out of words is a spelling error, nothing more or less. When I’m forced to change my name because of broken old tools I really feel the programmers behind the tool could do better. (I happen to have an ASCII-compatible nickname, which I prefer using to the ASCII-maimed version of my name where I can.) I feel 2010 is very late to accept that we live in a wide world and that people should be able to just use their names with computer systems. By the way, you still haven’t retitled this bug to address my other remark :) |
|||
| msg115571 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年09月04日 13:19 | |
Other C files converted from latin-1 to utf-8 with r84485. |
|||
| msg115824 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年09月07日 23:44 | |
From IRC: Me: UTF-8 was not strictly valid in ANSI C comments, so it is a bug in untabify to assume UTF-8 in C files. Merwok: Works for me. I am lowering the priority because it looks like untabify does not fail on the current code base. I'll follow up on python-dev to find out whether ASCII or UTF-8 should be enforced by untabify. |
|||
| msg115828 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年09月08日 00:08 | |
Why would it be the job of untabify to report invalid non-ASCII characters in C files? |
|||
| msg115830 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年09月08日 00:29 | |
On Tue, Sep 7, 2010 at 8:08 PM, Éric Araujo <report@bugs.python.org> wrote: .. > Why would it be the job of untabify to report invalid non-ASCII characters in C files? > Since untabify works by loading C code as text, it has to assume some encoding. Failing with uncaught decode error (as it currently does on non UTF-8 source) is not very user friendly. For example, the diagnostic does not report the position of the offending character and does not explain how to fix the source. |
|||
| msg115831 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年09月08日 00:31 | |
My real question was: Shouldn’t this be a VCS hook instead of untabify’s job? (or in addition to untabify if you insist) |
|||
| msg115837 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年09月08日 01:11 | |
On Tue, Sep 7, 2010 at 8:31 PM, Éric Araujo <report@bugs.python.org> wrote: .. > My real question was: Shouldn’t this be a VCS hook instead of untabify’s job? (or in addition to untabify if you insist) > Yes, VCS hook makes sense (and may almost eliminate the need to handle invalid bytestreams in untabify). The hard question is still the same, though: are non-ascii characters allowed in python C code? My answer is "no". |
|||
| msg115838 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年09月08日 01:13 | |
I agree with your reply (that’s what I meant with "works for me", the question about untabify vs. hooks only occurred to me after our IRC exchange). |
|||
| msg122923 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年11月30日 17:32 | |
Committed revision 86893 that makes untabify.py respect encoding cookie in the files it processes. I don't think there is anything else that needs to be done here. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:05 | admin | set | github: 53807 |
| 2010年12月30日 22:14:16 | georg.brandl | unlink | issue7962 dependencies |
| 2010年11月30日 17:32:22 | belopolsky | set | status: open -> closed resolution: fixed messages: + msg122923 stage: resolved |
| 2010年09月08日 01:13:13 | eric.araujo | set | messages: + msg115838 |
| 2010年09月08日 01:11:42 | belopolsky | set | messages: + msg115837 |
| 2010年09月08日 00:31:53 | eric.araujo | set | messages: + msg115831 |
| 2010年09月08日 00:29:52 | belopolsky | set | messages: + msg115830 |
| 2010年09月08日 00:08:29 | eric.araujo | set | messages: + msg115828 |
| 2010年09月07日 23:44:29 | belopolsky | set | priority: normal -> low assignee: belopolsky messages: + msg115824 |
| 2010年09月04日 13:19:45 | flox | set | nosy:
+ flox messages: + msg115571 components: + Unicode |
| 2010年09月04日 00:25:48 | eric.araujo | set | messages: + msg115548 |
| 2010年09月03日 23:18:08 | belopolsky | set | messages: + msg115540 |
| 2010年09月03日 22:58:57 | eric.araujo | set | messages: + msg115534 |
| 2010年09月03日 22:47:05 | belopolsky | set | messages: + msg115527 |
| 2010年09月03日 22:14:10 | eric.araujo | set | nosy:
+ pitrou messages: + msg115517 |
| 2010年08月25日 23:26:23 | eric.araujo | set | messages: + msg114948 |
| 2010年08月18日 06:01:53 | Claudiu.Popa | set | nosy:
+ Claudiu.Popa messages: + msg114198 |
| 2010年08月13日 23:08:48 | belopolsky | set | nosy:
+ eric.araujo |
| 2010年08月13日 23:06:43 | belopolsky | link | issue7962 dependencies |
| 2010年08月13日 23:06:12 | belopolsky | create | |