homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: update PEP 393 (match implementation)
Type: Stage: patch review
Components: Documentation Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Jim.Jewett, docs@python, ezio.melotti, jcea, loewis, vstinner
Priority: normal Keywords: patch

Created on 2011年12月15日 04:25 by Jim.Jewett, last changed 2022年04月11日 14:57 by admin.

Files
File name Uploaded Description Edit
pep-0393.txt.patch Jim.Jewett, 2011年12月15日 04:25 updated PEP 393, patch format
pep-0393.txt Jim.Jewett, 2011年12月15日 04:27 updated PEP 393, updated version only
pep-0393.txt Jim.Jewett, 2011年12月15日 21:15 updated to reflect feedback
pep-0393.txt Jim.Jewett, 2011年12月16日 00:34 replacement text
pep-0393v20111215.patch Jim.Jewett, 2011年12月16日 00:38 diff of latest against current hg
pep-0393.txt Jim.Jewett, 2011年12月16日 13:50 updated to reflect Martin's answers
pep-0393_20111216.txt.patch Jim.Jewett, 2011年12月16日 13:52 diff of latest against current hg
Messages (9)
msg149497 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011年12月15日 04:25
The implementation has a larger state.kind
Clarified wording on wstr_length and surrogate pairs.
Clarified that the canonical "data" format doesn't always have a data pointer.
Mentioned that calling PyUnicode_READY would finalize a string, so that it couldn't be resized.
Changed section head "Other macros" to "Finalization macro" and removed the non-existent PyUnicode_CONVERT_BYTES (there is a similarly named private macro).
msg149558 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011年12月15日 14:03
Various comments of the PEP 393 and your patch.
"For compatibility with existing APIs, several representations
may exist in parallel; over time, this compatibility should be phased
out."
and
"For compatibility, redundant representations may be computed."
I never understood this statement: in most cases, PyUnicode_READY() replaces the Py_UNICODE* (wchar_t*) representation by a compact representation.
PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(), ... do reallocate a Py_UNICODE* string for a ready string, but I don't think that it is a usual use case. 
PyUnicode_AS_UNICODE() & friends are usually only used to build strings. So this issue should be documented in a section different than the Abstract, maybe in a Limitations section.
So even if a third party module uses the legagy Unicode API, the PEP 393 will still optimize the memory usage thanks to implicit calls to PyUnicode_READY() (done everywhere in Python source code).
In the current code, the most common case where a string has two representations is the conversion to wchar_t* on Windows. PyUnicode_AsUnicode() is used to encode arguments for the Windows Unicode API, and PyUnicode_AsUnicode() keeps the result in the wstr attribute.
Note: there is also the utf8 attribute which may contain a third representation if PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize() (or the old _PyUnicode_AsString()) is called.
"Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length)."
They can also be created by PyUnicode_FromUnicode().
"Resizing a Unicode string remains possible until it is finalized, generally by calling PyUnicode_READY."
I changed PyUnicode_Resize(): it is now *always* possible to resize a string. The change was required because some decoders overallocate the string, and then resize after decoding the input.
The sentence can be simply removed.
+ + 000 => str is not initialized (data are in wstr)
+ + 001 => 1 byte (Latin-1)
+ + 010 => 2 byte (UCS-2)
+ + 100 => 4 byte (UCS-4)
+ + Other values are reserved at this time.
I don't like binary numbers, I would prefer decimal numbers here. Binary was maybe useful when we used bit masks, but we are now using the C "unsigned int field:bit_size;" trick for a nicer API. With the new values, it is even easier to remember them:
 1 byte <=> kind=1
 2 bytes <=> kind=2
 4 bytes <=> kind=4
"[PyUnicode_AsUTF8] is thus identical to the existing _PyUnicode_AsString, which is removed"
_PyUnicode_AsString() does still exist and is still heavily used (66 calls). It is not documented as deprecated in What's New in Python 3.3 (but it is a private function, so nobody uses it, right?.
"This section summarizes the API additions."
PyUnicode_IS_ASCII() is missing.
PyUnicode_CHARACTER_SIZE() has been removed (use kind directly).
UCS4 utility functions:
Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr} have been removed.
"The following functions are added to the stable ABI (PEP 384), as they
are independent of the actual representation of Unicode objects: ...
... PyUnicode_WriteChar ...."
PyUnicode_WriteChar() allows to modify an immutable object, which is something specific to CPython. Well, the function does now raise an error if the string is no more modifiable (e.g. more than 1 reference to the string, the hash has already been computed, etc.), but I don't know if it should be added to the stable ABI.
"PyUnicode_AsUnicodeAndSize"
This function was added to Python 3.3 and is directly deprecated. Why adding a function to deprecate it? PyUnicode_AsUnicode() and PyUnicode_GET_SIZE() were not enough?
"Deprecations, Removals, and Incompatibilities"
Missing: PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp
--
A very important point is not well explained: it is very important that a ("final") string is in its canonical representation. It means that a UCS2 string must contain at least a character bigger than U+00FF for example. Not only some optimizations rely on the canonical representation, but also some core methods of the Unicode type.
I tried to list all properties of Unicode objects in the definition of the PyASCIIbject structure. And I implemented checks in _PyUnicode_CheckConsistency(). This method is only available in debug mode.
msg149577 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011年12月15日 21:20
Updated to resolve most of Victor's concerns, but this meant enough changes that I'm not sure it quite counts as editorial only.
A few questions that I couldn't answer:
(1) Upon string creation, do we want to *promise* to discard the UTF-8 and wstr, so that the caller can memory manage?
(2) PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp seemed to be there in the code I was looking at.
(3) I can't justify the born-deprecated function "PyUnicode_AsUnicodeAndSize". Perhaps rename it with a leading underscore? Though I'm not sure it is really needed at all.
(4) I tried to reword the "for compatibility" ... "redundant" part ... but I'm not sure I resolved it.
msg149579 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011年12月15日 22:45
> PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(),
> ... do reallocate a Py_UNICODE* string for a ready string, but I
> don't think that it is a usual use case.
Define "usual". There were certainly plenty of occurrences of that
in the Python code base, and I believe that extension modules also
use it, provided they care about the content of string objects at all.
> PyUnicode_AS_UNICODE() &
> friends are usually only used to build strings.
No. They are also used to inspect them.
> So even if a third party module uses the legagy Unicode API, the PEP
> 393 will still optimize the memory usage thanks to implicit calls to
> PyUnicode_READY() (done everywhere in Python source code).
... unless they inspect a given Unicode string, in which case it
will use twice the memory (or 1.5x).
> "Resizing a Unicode string remains possible until it is finalized,
> generally by calling PyUnicode_READY."
> 
> I changed PyUnicode_Resize(): it is now *always* possible to resize a
> string. The change was required because some decoders overallocate
> the string, and then resize after decoding the input.
> 
> The sentence can be simply removed.
Well, I meant the resizing of strings that doesn't move the object
in memory (i.e. unicode_resize). You (apparently) changed its signature
to take PyUnicode_Object** (instead of PyUnicode_Object*). It's probably
irrelevant since that's a unicodeobject.c-internal function, anyway.
> "PyUnicode_AsUnicodeAndSize"
> 
> This function was added to Python 3.3 and is directly deprecated. Why
> adding a function to deprecate it? PyUnicode_AsUnicode() and
> PyUnicode_GET_SIZE() were not enough?
If it was not in 3.2, we should certainly remove it right away.
msg149580 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011年12月15日 22:50
> (1) Upon string creation, do we want to *promise* to discard the UTF-8 and wstr, so that the caller can memory manage?
I don't understand the question. Assuming "discards" means "releases"
here, then there is no API which releases memory during creation of
the string object - let alone that there is any promise to do so. I'm
also not aware of any candidate buffer that you might want to release.
> (2) PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp seemed to be there in the code I was looking at.
That's very well possible. What's the question?
> (3) I can't justify the born-deprecated function "PyUnicode_AsUnicodeAndSize". Perhaps rename it with a leading underscore? Though I'm not sure it is really needed at all.
Nobody noticed that it is born-deprecated. If it really is, it should be
removed before the release.
msg149584 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011年12月16日 00:34
>> So even if a third party module uses the legagy Unicode API, the PEP
>> 393 will still optimize the memory usage thanks to implicit calls to
>> PyUnicode_READY() (done everywhere in Python source code).
> ... unless they inspect a given Unicode string, in which case it
> will use twice the memory (or 1.5x).
Why is the utf-8 representation not cached when it is generated for ParseTuple et alia?
It seems like these parameters are likely to either be re-used as parameters (in which case caching makes sense) or not re-used at all (in which case, the whole string can go away).
> Well, I meant the resizing of strings that doesn't move the object
> in memory (i.e. unicode_resize).
This may easily fail because the new size can't be found at that location; wouldn't it be better to just encourage proper sizing in the first place?
>> (1) Upon string creation, do we want to *promise* to discard
>> the UTF-8 and wstr, so that the caller can memory manage?
> I don't understand the question. Assuming "discards" means
> "releases" here, then there is no API which releases memory
> during creation of the string object - let alone that there is
> any promise to do so. I'm also not aware of any candidate buffer
> that you might want to release.
When a string is created from a wchar_t array, who is responsible for releasing the original wchar_t array? As I read it now, Python doesn't release the buffer, and the caller can't because maybe Python just pointed to it as memory shared with the canonical representation. 
>> (2) PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp 
>> seemed to be there in the code I was looking at.
> That's very well possible. What's the question?
Victor listed them as missing. I now suspect he meant "missing from the PEP list of deprecated functions and macros", and I just misunderstood.
msg149594 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011年12月16日 05:41
> Why is the utf-8 representation not cached when it is generated for
> ParseTuple et alia?
It is.
> When a string is created from a wchar_t array, who is responsible for
> releasing the original wchar_t array?
The caller.
> As I read it now, Python
> doesn't release the buffer, and the caller can't because maybe Python
> just pointed to it as memory shared with the canonical
> representation.
But Python won't; it will always make a copy for itself.
msg149623 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011年12月16日 13:50
>> Why is the utf-8 representation not cached when it is generated for
>> ParseTuple et alia?
My error -- I read something backwards.
>> When a string is created from a wchar_t array, who is responsible for
>> releasing the original wchar_t array?
> The caller.
OK, I'll document that.
>> As I read it now, Python
>> doesn't release the buffer, and the caller can't because maybe Python
>> just pointed to it as memory shared with the canonical
>> representation.
> But Python won't; it will always make a copy for itself.
I thought I found an example each way, but it is possible that the shared version was something python had already copied. If not, I'll raise that as a separate issue to get the code changed.
(Note that I may not be able to look at this again until after Christmas, so I'm likely to go silent for a while.)
msg184148 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013年03月14日 08:19
What's the status of this?
History
Date User Action Args
2022年04月11日 14:57:24adminsetgithub: 57813
2013年03月14日 08:19:59ezio.melottisetnosy: + ezio.melotti
messages: + msg184148
2011年12月16日 16:02:33jceasetnosy: + jcea
2011年12月16日 13:52:38Jim.Jewettsetfiles: + pep-0393_20111216.txt.patch
2011年12月16日 13:50:19Jim.Jewettsetfiles: + pep-0393.txt

messages: + msg149623
2011年12月16日 05:41:31loewissetmessages: + msg149594
2011年12月16日 00:38:51Jim.Jewettsetfiles: + pep-0393v20111215.patch
2011年12月16日 00:34:30Jim.Jewettsetfiles: + pep-0393.txt

messages: + msg149584
2011年12月15日 22:50:05loewissetmessages: + msg149580
2011年12月15日 22:45:26loewissetmessages: + msg149579
2011年12月15日 21:20:48Jim.Jewettsetmessages: + msg149577
2011年12月15日 21:15:33Jim.Jewettsetfiles: + pep-0393.txt
2011年12月15日 14:03:42vstinnersetmessages: + msg149558
2011年12月15日 09:58:41pitrousetnosy: + loewis, vstinner

stage: patch review
2011年12月15日 04:27:24Jim.Jewettsetfiles: + pep-0393.txt
versions: + Python 3.3
nosy: + docs@python

assignee: docs@python
components: + Documentation
2011年12月15日 04:25:46Jim.Jewettcreate

AltStyle によって変換されたページ (->オリジナル) /