This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年06月05日 11:02 by ncoghlan, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue21667_clarify_str_specification.rst | ncoghlan, 2014年06月05日 11:29 | Add indexing note, remove inaccurate references to "characters" | review | |
| Messages (20) | |||
|---|---|---|---|
| msg219793 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:02 | |
Based on the recent python-dev thread, I propose the following "CPython implementation detail" note in the "Strings" entry of https://docs.python.org/3/reference/datamodel.html#objects-values-and-types "CPython currently guarantees O(1) access to arbitrary code points when indexing and slicing a string. Python implementations are required to index and slice strings as arrays of code points, but are not required to guarantee O(1) access to arbitrary locations within the string. This allows implementations to use variable width encodings for their internal string representation." |
|||
| msg219794 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2014年06月05日 11:05 | |
str[a:b] returns a substring (characters), not an array of code points (numbers). |
|||
| msg219795 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:06 | |
Guido, I think we need your call on whether or not to add a note about string indexing algorithmic complexity to the language reference, and to approve the exact wording of such a note (my proposed wording is in my initial comment on this issue). |
|||
| msg219796 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:08 | |
No, Python doesn't expose Unicode characters in its data model at all, except in those cases where a code point happens to correspond directly with a character. A length 1 str instance represents a Unicode code point, not a Unicode character. |
|||
| msg219797 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:09 | |
Although, you're right, that section of the data model docs misuses the word "character" to mean something other than what it means in the Unicode spec :( |
|||
| msg219798 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2014年06月05日 11:10 | |
"Python implementations are required to ..." By the way, Python < 3.3 doesn't implement this requirement :-) |
|||
| msg219799 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:17 | |
Saying that ord() and chr() switch between characters and code points is just plain wrong, since characters may be represented as multiple code points. We may also want to explicitly note that the Unicode normalisation is implementation dependendent, and that CPython doesn't normalise implicitly except where specifically noted (i.e. during lexical analysis). |
|||
| msg219800 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:18 | |
Right, narrow builds have long been broken - that's a large part of why this is now the requirement :) |
|||
| msg219801 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:29 | |
Patch attached that also addresses the characters vs code points confusion. |
|||
| msg219802 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 11:29 | |
I ducked the Unicode normalisation question for now, since that's a *different* can of worms :) |
|||
| msg219805 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年06月05日 12:02 | |
Two things:
- I don't think it's very helpful to use the term "code point" without explaining or introducing it ("character" at least can be understood intuitively)
- The mention of slicing is ambiguous: is slicing suppoded to be O(1)? how is indexing related to slicing?
|
|||
| msg219809 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月05日 12:34 | |
If someone doesn't understand what "Unicode code point" means, that's going to be the least of their problems when it comes to implementing a conformant Python implementation. We could link to http://unicode.org/glossary/#code_point, but that doesn't really add much beyond "value from 0 to 0x10FFFF". If you try to dive into the formal Unicode spec instead, you end up in a twisty maze of definitions of things that are all closely related, but generally not the same thing (code positions, code units, code spaces, abstract characters, glyphs, graphemes, etc). The main advantage of using the more formal "code point" over the informal "character" is that it discourages people from assuming they know what they are (with the usual mistaken assumption being that Unicode code points correspond directly to glyphs the way ASCII and Extended ASCII printable characters correspond to their glyphs). The rest of the paragraph then provides the mechanical details of the meaningful interpretations of them in Python (as length 1 strings and as numbers in a particular range) and the operations for translating between those two formats (chr and ord). Fair point about the slicing - it may be better to just talk about indexing. |
|||
| msg219810 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年06月05日 12:37 | |
Not sure what implementing a conformant Python implementation has to do with this; the language specification should be readable by any interested programmers, IMO. > If you try to dive into the formal Unicode spec instead, you end up > in a twisty maze of definitions of things that are all closely > related, but generally not the same thing (code positions, code units, > code spaces, abstract characters, glyphs, graphemes, etc). That makes all the less useful to use the "proper term" instead of the more intuitive alternative :-) |
|||
| msg219811 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年06月05日 12:45 | |
Then perhaps we need notes about algorithmic complexity of bytes, bytearray, list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, list.pop, len(), + and += for all basic sequences and containers, memoryview() for bytes, bytearray and array.array, etc, etc. |
|||
| msg219812 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2014年06月05日 12:49 | |
> Then perhaps we need notes about algorithmic complexity of bytes, bytearray, list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, list.pop, len(), + and += for all basic sequences and containers, memoryview() for bytes, bytearray and array.array, etc, etc. That would be awesome :-) Please open a new issue for that. We can use for example these data: https://wiki.python.org/moin/TimeComplexity Such data should be close to the definition of the type, or even close to the method. Maybe directly in the documentation of each method? |
|||
| msg219824 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2014年06月05日 16:32 | |
I don't want the O(1) property explicitly denounced in the reference manual. It's fine if the manual is silent on this -- maybe someone can prove that it isn't a problem based on benchmarks of an alternate implementation, but until then, I'm skeptical -- after all we have a bunch of important APIs (indexing, slicing, find()/index(), the re module) that use integer indexes, and some important algorithms/patterns are based off this behavior. E.g. searching a string for something, returning the position where it's found, and then continuing the search from that position. Even if the basic search uses find() or regex matching, there's still a position being returned and accepted, and if it took O(N) time to find that position in the representation, the whole algorithm could degenerate from O(N) to O(N**2). I am fine with the changes related to code points. For the pedants amongst us, surrogates are also code points. A surrogate pair is two code points that encode a single code point. Fortunately we don't have to deal with those any more outside codecs. |
|||
| msg219936 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年06月07日 13:22 | |
New changeset 6ffb6909c439 by Nick Coghlan in branch '3.4': Issue #21667: Clarify string data model description http://hg.python.org/cpython/rev/6ffb6909c439 New changeset 7c120e77d6f7 by Nick Coghlan in branch 'default': Merge issue #21667 from 3.4 http://hg.python.org/cpython/rev/7c120e77d6f7 |
|||
| msg219937 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年06月07日 13:26 | |
I've merged the character->code point clarifications, without the implementation detail section. For the time being, that leaves "doesn't provide O(1) indexing of strings" as the kind of discrepancy that often makes an appearance in "differences from the CPython reference implementation" section that many alternative implementations include. |
|||
| msg220208 - (view) | Author: Jim Jewett (Jim.Jewett) * (Python triager) | Date: 2014年06月10日 23:02 | |
I think the new wording is an improvement, but keeping the changes minimal left it in an awkward in-between state. Proposal: A string is a sequence of Unicode code points. Strings can include any sequence of code points, including some which are semantically meaningless, or explicitly undefined. Python doesn't have a :c:type:`char` type; a single code point is represented as a string of length ``1``. The built-in function :func:`chr` translates an integer in the range ``U+0000 - U+10FFFF`` to the corresponding length ``1`` string object, and :func:`ord` does the reverse. :meth:`str.encode` provides a concrete representation (as :class:`bytes` in the given text encoding) suitable for transport and communication with non-Python utilities. :meth:`bytes.decode` decodes such byte sequences into text strings. |
|||
| msg220211 - (view) | Author: Jim Jewett (Jim.Jewett) * (Python triager) | Date: 2014年06月10日 23:20 | |
And even my rewrite showed path dependency; a slight further improvement is to re-order encoding ahead of bytes. I also added a paragraph that I hope answers the speed issue. Proposal: A string is a sequence of Unicode code points. Strings can include any sequence of code points, including some which are semantically meaningless, or explicitly undefined. Python doesn't have a :c:type:`char` type; a single code point is represented as a string of length ``1``. The built-in function :func:`chr` translates an integer in the range ``U+0000 - U+10FFFF`` to the corresponding length ``1`` string object, and :func:`ord` does the reverse. :meth:`str.encode` provides a concrete representation (in the given text encoding) as a :class:`bytes` object suitable for transport and communication with non-Python utilities. :meth:`bytes.decode` decodes such byte sequences into text strings. .. impl-detail:: There are no methods exposing the internal representation of code points within a string. While the C-API provides some additional constraints on CPython, other implementations are free to use any representation that treats code points (as opposed to either code units or some normalized form of characters) as the unit of measure. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:04 | admin | set | github: 65866 |
| 2014年06月10日 23:20:46 | Jim.Jewett | set | messages: + msg220211 |
| 2014年06月10日 23:02:39 | Jim.Jewett | set | nosy:
+ Jim.Jewett messages: + msg220208 |
| 2014年06月07日 13:26:31 | ncoghlan | set | status: open -> closed type: enhancement messages: + msg219937 resolution: later stage: resolved |
| 2014年06月07日 13:22:28 | python-dev | set | nosy:
+ python-dev messages: + msg219936 |
| 2014年06月05日 16:32:42 | gvanrossum | set | messages: + msg219824 |
| 2014年06月05日 12:51:05 | Rosuav | set | nosy:
+ Rosuav |
| 2014年06月05日 12:49:03 | vstinner | set | messages: + msg219812 |
| 2014年06月05日 12:45:52 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg219811 |
| 2014年06月05日 12:37:41 | pitrou | set | messages: + msg219810 |
| 2014年06月05日 12:34:08 | ncoghlan | set | messages: + msg219809 |
| 2014年06月05日 12:02:59 | pitrou | set | nosy:
+ pitrou messages: + msg219805 |
| 2014年06月05日 11:29:45 | ncoghlan | set | messages: + msg219802 |
| 2014年06月05日 11:29:06 | ncoghlan | set | files:
+ issue21667_clarify_str_specification.rst messages: + msg219801 |
| 2014年06月05日 11:18:43 | ncoghlan | set | messages: + msg219800 |
| 2014年06月05日 11:17:54 | ncoghlan | set | messages: + msg219799 |
| 2014年06月05日 11:10:40 | vstinner | set | messages: + msg219798 |
| 2014年06月05日 11:09:49 | ncoghlan | set | messages: + msg219797 |
| 2014年06月05日 11:08:38 | ncoghlan | set | messages: + msg219796 |
| 2014年06月05日 11:06:08 | ncoghlan | set | nosy:
+ gvanrossum messages: + msg219795 |
| 2014年06月05日 11:05:12 | vstinner | set | nosy:
+ vstinner messages: + msg219794 |
| 2014年06月05日 11:02:22 | ncoghlan | create | |