This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年06月12日 22:13 by benjamin.peterson, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| maxunicode.patch | schuppenies, 2008年06月13日 19:41 | Patch against 2.6 trunk, revision 64230 | ||
| Py_UNICODE.patch | schuppenies, 2008年06月13日 19:42 | Patch against 2.6 trunk, revision 64230 | ||
| Py_UNICODE_SIZEOF.patch | schuppenies, 2008年06月15日 16:45 | Patch against 2.6 trunk, revision 64296 | ||
| Messages (23) | |||
|---|---|---|---|
| msg68102 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年06月12日 22:13 | |
test test_sys failed -- Traceback (most recent call last): File "/temp/python/trunk/Lib/test/test_sys.py", line 549, in test_specialtypes size2=basicsize + sys.getsizeof(str(s))) File "/temp/python/trunk/Lib/test/test_sys.py", line 429, in check_sizeof self.assertEqual(result, size2, msg + str(size2)) AssertionError: wrong size for <type 'unicode'>: got 28, expected 50.5109328552 |
|||
| msg68104 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年06月12日 22:19 | |
It was recommended by Georg that you expose Py_UNICODE_SIZE in the _testcapi, since the size is not consistent across all platforms. |
|||
| msg68138 - (view) | Author: Robert Schuppenies (schuppenies) * (Python committer) | Date: 2008年06月13日 09:04 | |
Are they any buildbots running with the "--enable-unicode=ucs4" option? Just curious. |
|||
| msg68141 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2008年06月13日 09:21 | |
I'm sure there wasn't any a few months ago. |
|||
| msg68159 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月13日 13:59 | |
Do you really need to expose Py_UNICODE_SIZE? There is already sys.maxunicode, unless I'm missing something. |
|||
| msg68160 - (view) | Author: Georg Brandl (georg.brandl) * (Python committer) | Date: 2008年06月13日 14:09 | |
It is true that sys.maxunicode reflects whether the build is using UCS-2 or UCS-4; however, the size of Py_UNICODE is not fixed by that, look at unicodeobject.h. (Though I don't think we have platforms that actually *do* use sizes other than 2 or 4, so we can of course be sloppy.) |
|||
| msg68177 - (view) | Author: Robert Schuppenies (schuppenies) * (Python committer) | Date: 2008年06月13日 19:42 | |
sys.maxunicode is well defined to be either 0xFFFF for UCS-2 or 0x10FFFF for UCS-4 (see PyUnicode_GetMax). Py_UNICODE_SIZE is set in pyconfig.h to be either 2 or 4 during configuration. When >= 4, Py_UNICODE_WIDE is set which again influences sys.maxunicode. Thus, it currently is possible to derive Py_UNICODE_SIZE from sys.maxunicode. But it takes some indirections. So here are 2 possible patches, one which exposes Py_UNICODE_SIZE via _testcapi and one which assumes that sys.maxunicode reflects UCS-X settings. Since I am a fairly new Python developer and the new 4-eyes-per-commit policy for the beta phase, please decide which patch should be applied. |
|||
| msg68178 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年06月13日 19:50 | |
Personally, I prefer the one with _testcapi.Py_UNICODE_SIZE because it is safe against future changes, but wait for someone else's opinion. |
|||
| msg68179 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2008年06月13日 19:51 | |
It's actually very easy: Py_UNICODE is a 2-byte value for UCS-2 builds and 4 byte value for UCS-4 builds of Python. print ((sys.maxunicode < 66000) and 'UCS2' or 'UCS4') tells you which one you have. Note that you should *not* use the exact value of 0x10FFFF for UCS-4 - it's possible that the Unicode consortium decides to add more planes to the Universal Character Set... (though not likely). The above comparison is good enough to detect the number of bytes in a single code point, though. |
|||
| msg68180 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2008年06月13日 19:54 | |
BTW: Here's another trick you can use:
print 'sizeof(Py_UNICODE) =', len(u'0円'.encode('unicode-internal'))
(for Py2.x)
|
|||
| msg68181 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月13日 19:56 | |
Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end up being more than 4 if the native int type is itself larger than 32 bits; although the latter is probably quite rare (64-bit platforms are usually either LP64 or LLP64). However, Py_UNICODE.patch is wrong in that it uses Py_UNICODE_SIZE rather than sizeof(Py_UNICODE). Py_UNICODE_SIZE itself is always either 2 or 4. |
|||
| msg68182 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2008年06月13日 20:18 | |
On 2008年06月13日 21:56, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end > up being more than 4 if the native int type is itself larger than 32 > bits; although the latter is probably quite rare (64-bit platforms are > usually either LP64 or LLP64). AFAIK, only Crays have this problem, but apart from that: I'd consider it a bug if sizeof(Py_UCS4) != 4. |
|||
| msg68183 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月13日 20:32 | |
Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit : > AFAIK, only Crays have this problem, but apart from that: I'd consider > it a bug if sizeof(Py_UCS4) != 4. Perhaps a #error can be added to that effect? Something like (untested): #if SIZEOF_INT == 4 typedef unsigned int Py_UCS4; #elif SIZEOF_LONG == 4 typedef unsigned long Py_UCS4; #else #error Could not find a 4-byte integer type for Py_UCS4, aborting #endif (of course we could also try harder to find an appropriate type, but I'm no specialist in C integer variations) |
|||
| msg68184 - (view) | Author: Robert Schuppenies (schuppenies) * (Python committer) | Date: 2008年06月13日 21:01 | |
I think you're right that sizeof(Py_UNICODE) is the correct value to
use. But could you please explain to me how PY_UNICODE_TYPE is set, I
cannot find it.
Also, len(u'0円'.encode('unicode-internal')) does not work for Py3.0.
Any suggestion how could this information can be retrieved in py3k?
|
|||
| msg68185 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年06月13日 21:21 | |
I believe Py_UNICODE_TYPE is set be configure in pyconfig.h. |
|||
| msg68186 - (view) | Author: Robert Schuppenies (schuppenies) * (Python committer) | Date: 2008年06月13日 21:59 | |
Found it, thanks. Wrong use of grep :| |
|||
| msg68231 - (view) | Author: Robert Schuppenies (schuppenies) * (Python committer) | Date: 2008年06月15日 13:18 | |
If I understand configure correctly, PY_UNICODE_TYPE is only set when a type matching the size of $unicode_size is found. And this is set to either 2 or 4. Thus, sizeof(Py_UNICODE) should always return 2 or 4. If you agree, I would suggest using the method proposed by Marc in msg68179. |
|||
| msg68234 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月15日 13:39 | |
Le dimanche 15 juin 2008 à 13:18 +0000, Robert Schuppenies a écrit : > If I understand configure correctly, PY_UNICODE_TYPE is only set when > a type matching the size of $unicode_size is found. And this is set to > either 2 or 4. Buf if PY_UNICODE_TYPE is not set in configure, unicodeobject.h tries to settle on a default value. Which turns out to be Py_UCS4 in UCS4 builds: http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l86 And Py_UCS4 itself will be larger than 4 bytes if the platform's int size is larger than that: http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l119 So if you want to be 100% correct, you should use sizeof(PY_UNICODE_TYPE) (or sizeof(Py_UNICODE), which is the same). If you don't want to, sys.maxunicode is sufficient :-) |
|||
| msg68242 - (view) | Author: Robert Schuppenies (schuppenies) * (Python committer) | Date: 2008年06月15日 16:45 | |
Correct is good, so here is a patch which exposes the size of Py_UNICODE via _testcapi. |
|||
| msg68251 - (view) | Author: Georg Brandl (georg.brandl) * (Python committer) | Date: 2008年06月15日 20:49 | |
Looks good to me. |
|||
| msg68265 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2008年06月16日 09:57 | |
On 2008年06月13日 22:32, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit : >> AFAIK, only Crays have this problem, but apart from that: I'd consider >> it a bug if sizeof(Py_UCS4) != 4. > > Perhaps a #error can be added to that effect? > Something like (untested): > > #if SIZEOF_INT == 4 > typedef unsigned int Py_UCS4; > #elif SIZEOF_LONG == 4 > typedef unsigned long Py_UCS4; > #else > #error Could not find a 4-byte integer type for Py_UCS4, aborting > #endif Sounds good ! > (of course we could also try harder to find an appropriate type, but I'm > no specialist in C integer variations) Python should really try to use uint32_t as fallback solution for UCS4 where available (and uint16_t for UCS2). We'd have to add an AC_TYPE_INT32_T and AC_TYPE_INT16_T check to configure: http://www.gnu.org/software/autoconf/manual/html_node/Particular-Types.html#Particular-Types and could then use typedef uint32_t Py_UCS4 and typedef uint16_t Py_UCS2 Note that the code for supporting UCS2/UCS4 is not really all that clean. It was a quick sprint between Martin and Fredrik and appears to be only half-done... e.g. there currently is no Py_UCS2. |
|||
| msg68271 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2008年06月16日 16:21 | |
On 2008年06月13日 21:54, Marc-Andre Lemburg wrote:
> BTW: Here's another trick you can use:
>
> print 'sizeof(Py_UNICODE) =', len(u'0円'.encode('unicode-internal'))
>
> (for Py2.x)
... and for Py3.x:
print(len(u'0円'.encode('unicode-internal')))
There's really no need to drop to C to get at sizeof(Py_UNICODE).
|
|||
| msg68312 - (view) | Author: Robert Schuppenies (schuppenies) * (Python committer) | Date: 2008年06月17日 10:34 | |
I followed Marc's advise and checked-in a corrected test. Besides, I opened a new issue to address the fallback solution for UCS4 and UCS2 (see issue3130). |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:35 | admin | set | github: 47348 |
| 2009年04月27日 01:10:42 | ajaksu2 | link | issue3130 dependencies |
| 2008年06月17日 10:34:08 | schuppenies | set | status: open -> closed resolution: fixed messages: + msg68312 |
| 2008年06月16日 16:21:42 | lemburg | set | messages: + msg68271 |
| 2008年06月16日 09:57:19 | lemburg | set | messages: + msg68265 |
| 2008年06月15日 20:49:59 | georg.brandl | set | messages: + msg68251 |
| 2008年06月15日 16:45:56 | schuppenies | set | files:
+ Py_UNICODE_SIZEOF.patch messages: + msg68242 |
| 2008年06月15日 13:39:09 | pitrou | set | messages: + msg68234 |
| 2008年06月15日 13:18:53 | schuppenies | set | messages: + msg68231 |
| 2008年06月13日 21:59:36 | schuppenies | set | messages: + msg68186 |
| 2008年06月13日 21:21:37 | benjamin.peterson | set | messages: + msg68185 |
| 2008年06月13日 21:01:08 | schuppenies | set | messages: + msg68184 |
| 2008年06月13日 20:32:41 | pitrou | set | messages: + msg68183 |
| 2008年06月13日 20:18:22 | lemburg | set | messages: + msg68182 |
| 2008年06月13日 19:56:47 | pitrou | set | messages: + msg68181 |
| 2008年06月13日 19:54:53 | lemburg | set | messages: + msg68180 |
| 2008年06月13日 19:51:41 | lemburg | set | nosy:
+ lemburg messages: + msg68179 |
| 2008年06月13日 19:50:41 | benjamin.peterson | set | messages: + msg68178 |
| 2008年06月13日 19:42:05 | schuppenies | set | files:
+ Py_UNICODE.patch messages: + msg68177 |
| 2008年06月13日 19:41:27 | schuppenies | set | files:
+ maxunicode.patch keywords: + patch |
| 2008年06月13日 14:09:09 | georg.brandl | set | nosy:
+ georg.brandl messages: + msg68160 |
| 2008年06月13日 13:59:54 | pitrou | set | nosy:
+ pitrou messages: + msg68159 |
| 2008年06月13日 09:21:18 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg68141 |
| 2008年06月13日 09:04:52 | schuppenies | set | messages: + msg68138 |
| 2008年06月12日 22:19:43 | benjamin.peterson | set | messages: + msg68104 |
| 2008年06月12日 22:13:16 | benjamin.peterson | create | |