This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年09月24日 13:54 by fenner, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Messages (10) | |||
|---|---|---|---|
| msg93074 - (view) | Author: Bill Fenner (fenner) | Date: 2009年09月24日 13:54 | |
In python 2.5, shlex handled unicode input fine: Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) [GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import shlex >>> shlex.split( u'Hello, World!' ) ['Hello,', 'World!'] In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly confusing execl: Python 2.6 (r26:66714, Jun 8 2009, 16:07:29) [GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import shlex >>> shlex.split( u'Hello, World' ) ['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00', '\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'] Even weirder, the two return strings have different byte order (see 'H\x00\x00\x00' vs. '\x00\x00\x00W'!) |
|||
| msg93075 - (view) | Author: Bill Fenner (fenner) | Date: 2009年09月24日 14:00 | |
A colleague pointed out that the bad behavior was introduced in 2.5.2: Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03) [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import shlex >>> shlex.split( u"Hello, World!" ) ['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00', '\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'] |
|||
| msg93079 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2009年09月24日 16:12 | |
I'll take the opposite point of view: the bad behavior was introduced with 2.5.1 (issue1548891, r52302), and reverted for 2.5.2 because "it broke backwards compatibility with arbitrary read buffers" (issue1730114, r53831) The difference is in cStringIO: >>> from cStringIO import StringIO >>> StringIO(u"Hello, World!").read() 'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00 \x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00' The byte order is not different in the two strings: but u" " becomes " \x00\x00\x00" and the three zeros are copied into the second item. |
|||
| msg93080 - (view) | Author: Bill Fenner (fenner) | Date: 2009年09月24日 17:21 | |
so, just to be clear, your position is that the output of shlex.split( u'Hello, World!' ) is *supposed* to be ['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00', '\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']? |
|||
| msg93082 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2009年09月24日 17:34 | |
Hm, while the StringIO behaviour supposedly cannot be changed for backwards-compatibility reasons, we can probably improve shlex behaviour with unicode strings. |
|||
| msg93083 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2009年09月24日 17:48 | |
(Presented this way, "my opinion" becomes difficult to stand... OTOH the docs say that the module does not support Unicode, so it's not strictly a bug) http://docs.python.org/library/shlex.html Yes, shlex could be improved and encode unicode strings to ascii. |
|||
| msg93084 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2009年09月24日 18:17 | |
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > (Presented this way, "my opinion" becomes difficult to stand... > OTOH the docs say that the module does not support Unicode, so it's not > strictly a bug) > http://docs.python.org/library/shlex.html > > Yes, shlex could be improved and encode unicode strings to ascii. I'd suggest to convert Unicode input to a string using an optional encoding parameter which defaults to 'utf-8' (most shells nowadays default to UTF-8). This is only a compromise, though, albeit a practical one. POSIX has the notion of a portable character set: http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tagtcjh_3 which is pretty much the same as ASCII. Any ASCII compatible encoding is then allowed via variable length encodings (see further down on that page). |
|||
| msg93085 - (view) | Author: Bill Fenner (fenner) | Date: 2009年09月24日 18:24 | |
Sorry, I didn't read the web documentation, only the module documentation, which doesn't mention Unicode. I'd agree that since it's a documented behavior, this bug can become: - an RFE for shlex to handle Unicode - meanwhile, if there will be any releases before that happens, an RFE for the module documentation to mention the lack of Unicode support |
|||
| msg112705 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2010年08月03日 22:08 | |
The discussion pretty much says this was a feature request, which is obsolete for 2.x. Not an issue for 3.x:
>>> import shlex
>>> shlex.split('Hello, World!' )
['Hello,', 'World!']
|
|||
| msg146200 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2011年10月22日 23:12 | |
$ ./python Python 2.7.2+ (2.7:27ae7d4e1983+, Oct 23 2011, 00:09:06) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import shlex >>> shlex.split(u'Hello, World!') ['Hello,', 'World!'] This was fixed indirectly by a StringIO fix in 27ae7d4e1983, for #1548891. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:53 | admin | set | github: 51237 |
| 2011年10月22日 23:12:11 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg146200 |
| 2010年08月03日 22:08:52 | terry.reedy | set | status: open -> closed nosy: + terry.reedy messages: + msg112705 type: enhancement resolution: out of date |
| 2009年09月25日 05:41:20 | ezio.melotti | set | priority: normal nosy: + ezio.melotti |
| 2009年09月24日 18:24:57 | fenner | set | messages:
+ msg93085 title: shlex.split() converts unicode input to UCS-4 output with varying byte order -> shlex.split() converts unicode input to UCS-4 output |
| 2009年09月24日 18:17:58 | lemburg | set | nosy:
+ lemburg title: shlex.split() converts unicode input to UCS-4 output with varying byte order -> shlex.split() converts unicode input to UCS-4 output with varying byte order messages: + msg93084 |
| 2009年09月24日 17:48:47 | amaury.forgeotdarc | set | resolution: wont fix -> (no value) messages: + msg93083 |
| 2009年09月24日 17:34:39 | pitrou | set | nosy:
+ pitrou messages: + msg93082 |
| 2009年09月24日 17:21:18 | fenner | set | status: pending -> open messages: + msg93080 |
| 2009年09月24日 16:12:15 | amaury.forgeotdarc | set | status: open -> pending nosy: + amaury.forgeotdarc messages: + msg93079 resolution: wont fix |
| 2009年09月24日 14:00:34 | fenner | set | messages:
+ msg93075 versions: + Python 2.5 |
| 2009年09月24日 13:54:07 | fenner | create | |