homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: sys.sizeof test fails with wide unicode
Type: behavior Stage:
Components: Interpreter Core Versions: Python 3.0, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: schuppenies Nosy List: amaury.forgeotdarc, benjamin.peterson, georg.brandl, lemburg, pitrou, schuppenies
Priority: critical Keywords: patch

Created on 2008年06月12日 22:13 by benjamin.peterson, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
maxunicode.patch schuppenies, 2008年06月13日 19:41 Patch against 2.6 trunk, revision 64230
Py_UNICODE.patch schuppenies, 2008年06月13日 19:42 Patch against 2.6 trunk, revision 64230
Py_UNICODE_SIZEOF.patch schuppenies, 2008年06月15日 16:45 Patch against 2.6 trunk, revision 64296
Messages (23)
msg68102 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008年06月12日 22:13
test test_sys failed -- Traceback (most recent call last):
 File "/temp/python/trunk/Lib/test/test_sys.py", line 549, in
test_specialtypes
 size2=basicsize + sys.getsizeof(str(s)))
 File "/temp/python/trunk/Lib/test/test_sys.py", line 429, in check_sizeof
 self.assertEqual(result, size2, msg + str(size2))
AssertionError: wrong size for <type 'unicode'>: got 28, expected
50.5109328552 
msg68104 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008年06月12日 22:19
It was recommended by Georg that you expose Py_UNICODE_SIZE in the
_testcapi, since the size is not consistent across all platforms.
msg68138 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008年06月13日 09:04
Are they any buildbots running with the "--enable-unicode=ucs4" option?
Just curious.
msg68141 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008年06月13日 09:21
I'm sure there wasn't any a few months ago.
msg68159 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008年06月13日 13:59
Do you really need to expose Py_UNICODE_SIZE? There is already
sys.maxunicode, unless I'm missing something.
msg68160 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008年06月13日 14:09
It is true that sys.maxunicode reflects whether the build is using UCS-2
or UCS-4; however, the size of Py_UNICODE is not fixed by that, look at
unicodeobject.h.
(Though I don't think we have platforms that actually *do* use sizes
other than 2 or 4, so we can of course be sloppy.)
msg68177 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008年06月13日 19:42
sys.maxunicode is well defined to be either 0xFFFF for UCS-2
or 0x10FFFF for UCS-4 (see PyUnicode_GetMax).
Py_UNICODE_SIZE is set in pyconfig.h to be either 2 or 4 during
configuration. When >= 4, Py_UNICODE_WIDE is set which again influences
sys.maxunicode.
Thus, it currently is possible to derive Py_UNICODE_SIZE from
sys.maxunicode. But it takes some indirections.
So here are 2 possible patches, one which exposes Py_UNICODE_SIZE via
_testcapi and one which assumes that sys.maxunicode reflects UCS-X
settings. Since I am a fairly new Python developer and the new
4-eyes-per-commit policy for the beta phase, please decide which patch
should be applied.
msg68178 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008年06月13日 19:50
Personally, I prefer the one with _testcapi.Py_UNICODE_SIZE because it
is safe against future changes, but wait for someone else's opinion.
msg68179 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年06月13日 19:51
It's actually very easy:
Py_UNICODE is a 2-byte value for UCS-2 builds and 4 byte value for UCS-4
builds of Python.
print ((sys.maxunicode < 66000) and 'UCS2' or 'UCS4')
tells you which one you have.
Note that you should *not* use the exact value of 0x10FFFF for UCS-4 -
it's possible that the Unicode consortium decides to add more planes to
the Universal Character Set... (though not likely).
The above comparison is good enough to detect the number of bytes in a
single code point, though.
msg68180 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年06月13日 19:54
BTW: Here's another trick you can use:
print 'sizeof(Py_UNICODE) =', len(u'0円'.encode('unicode-internal'))
(for Py2.x)
msg68181 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008年06月13日 19:56
Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
up being more than 4 if the native int type is itself larger than 32
bits; although the latter is probably quite rare (64-bit platforms are
usually either LP64 or LLP64).
However, Py_UNICODE.patch is wrong in that it uses Py_UNICODE_SIZE
rather than sizeof(Py_UNICODE). Py_UNICODE_SIZE itself is always either
2 or 4.
msg68182 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年06月13日 20:18
On 2008年06月13日 21:56, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
> up being more than 4 if the native int type is itself larger than 32
> bits; although the latter is probably quite rare (64-bit platforms are
> usually either LP64 or LLP64).
AFAIK, only Crays have this problem, but apart from that: I'd consider
it a bug if sizeof(Py_UCS4) != 4.
msg68183 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008年06月13日 20:32
Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :
> AFAIK, only Crays have this problem, but apart from that: I'd consider
> it a bug if sizeof(Py_UCS4) != 4.
Perhaps a #error can be added to that effect?
Something like (untested):
#if SIZEOF_INT == 4 
typedef unsigned int Py_UCS4; 
#elif SIZEOF_LONG == 4
typedef unsigned long Py_UCS4; 
#else
#error Could not find a 4-byte integer type for Py_UCS4, aborting
#endif
(of course we could also try harder to find an appropriate type, but I'm
no specialist in C integer variations)
msg68184 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008年06月13日 21:01
I think you're right that sizeof(Py_UNICODE) is the correct value to
use. But could you please explain to me how PY_UNICODE_TYPE is set, I
cannot find it.
Also, len(u'0円'.encode('unicode-internal')) does not work for Py3.0.
Any suggestion how could this information can be retrieved in py3k?
msg68185 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008年06月13日 21:21
I believe Py_UNICODE_TYPE is set be configure in pyconfig.h.
msg68186 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008年06月13日 21:59
Found it, thanks. Wrong use of grep :|
msg68231 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008年06月15日 13:18
If I understand configure correctly, PY_UNICODE_TYPE is only set when
a type matching the size of $unicode_size is found. And this is set to
either 2 or 4. Thus, sizeof(Py_UNICODE) should always return 2 or 4.
If you agree, I would suggest using the method proposed by Marc in
msg68179.
msg68234 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008年06月15日 13:39
Le dimanche 15 juin 2008 à 13:18 +0000, Robert Schuppenies a écrit :
> If I understand configure correctly, PY_UNICODE_TYPE is only set when
> a type matching the size of $unicode_size is found. And this is set to
> either 2 or 4.
Buf if PY_UNICODE_TYPE is not set in configure, unicodeobject.h tries to
settle on a default value. Which turns out to be Py_UCS4 in UCS4 builds:
http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l86
And Py_UCS4 itself will be larger than 4 bytes if the platform's int
size is larger than that:
http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l119
So if you want to be 100% correct, you should use
sizeof(PY_UNICODE_TYPE) (or sizeof(Py_UNICODE), which is the same). If
you don't want to, sys.maxunicode is sufficient :-)
msg68242 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008年06月15日 16:45
Correct is good, so here is a patch which exposes the size of
Py_UNICODE via _testcapi.
msg68251 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008年06月15日 20:49
Looks good to me.
msg68265 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年06月16日 09:57
On 2008年06月13日 22:32, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :
>> AFAIK, only Crays have this problem, but apart from that: I'd consider
>> it a bug if sizeof(Py_UCS4) != 4.
> 
> Perhaps a #error can be added to that effect?
> Something like (untested):
> 
> #if SIZEOF_INT == 4 
> typedef unsigned int Py_UCS4; 
> #elif SIZEOF_LONG == 4
> typedef unsigned long Py_UCS4; 
> #else
> #error Could not find a 4-byte integer type for Py_UCS4, aborting
> #endif
Sounds good !
> (of course we could also try harder to find an appropriate type, but I'm
> no specialist in C integer variations)
Python should really try to use uint32_t as fallback solution for
UCS4 where available (and uint16_t for UCS2).
We'd have to add an AC_TYPE_INT32_T and AC_TYPE_INT16_T check to
configure:
http://www.gnu.org/software/autoconf/manual/html_node/Particular-Types.html#Particular-Types
and could then use
typedef uint32_t Py_UCS4
and
typedef uint16_t Py_UCS2
Note that the code for supporting UCS2/UCS4 is not really all that
clean. It was a quick sprint between Martin and Fredrik and appears
to be only half-done... e.g. there currently is no Py_UCS2.
msg68271 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年06月16日 16:21
On 2008年06月13日 21:54, Marc-Andre Lemburg wrote:
> BTW: Here's another trick you can use:
> 
> print 'sizeof(Py_UNICODE) =', len(u'0円'.encode('unicode-internal'))
> 
> (for Py2.x)
... and for Py3.x:
print(len(u'0円'.encode('unicode-internal')))
There's really no need to drop to C to get at sizeof(Py_UNICODE).
msg68312 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008年06月17日 10:34
I followed Marc's advise and checked-in a corrected test.
Besides, I opened a new issue to address the fallback solution for
UCS4 and UCS2 (see issue3130).
History
Date User Action Args
2022年04月11日 14:56:35adminsetgithub: 47348
2009年04月27日 01:10:42ajaksu2linkissue3130 dependencies
2008年06月17日 10:34:08schuppeniessetstatus: open -> closed
resolution: fixed
messages: + msg68312
2008年06月16日 16:21:42lemburgsetmessages: + msg68271
2008年06月16日 09:57:19lemburgsetmessages: + msg68265
2008年06月15日 20:49:59georg.brandlsetmessages: + msg68251
2008年06月15日 16:45:56schuppeniessetfiles: + Py_UNICODE_SIZEOF.patch
messages: + msg68242
2008年06月15日 13:39:09pitrousetmessages: + msg68234
2008年06月15日 13:18:53schuppeniessetmessages: + msg68231
2008年06月13日 21:59:36schuppeniessetmessages: + msg68186
2008年06月13日 21:21:37benjamin.petersonsetmessages: + msg68185
2008年06月13日 21:01:08schuppeniessetmessages: + msg68184
2008年06月13日 20:32:41pitrousetmessages: + msg68183
2008年06月13日 20:18:22lemburgsetmessages: + msg68182
2008年06月13日 19:56:47pitrousetmessages: + msg68181
2008年06月13日 19:54:53lemburgsetmessages: + msg68180
2008年06月13日 19:51:41lemburgsetnosy: + lemburg
messages: + msg68179
2008年06月13日 19:50:41benjamin.petersonsetmessages: + msg68178
2008年06月13日 19:42:05schuppeniessetfiles: + Py_UNICODE.patch
messages: + msg68177
2008年06月13日 19:41:27schuppeniessetfiles: + maxunicode.patch
keywords: + patch
2008年06月13日 14:09:09georg.brandlsetnosy: + georg.brandl
messages: + msg68160
2008年06月13日 13:59:54pitrousetnosy: + pitrou
messages: + msg68159
2008年06月13日 09:21:18amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg68141
2008年06月13日 09:04:52schuppeniessetmessages: + msg68138
2008年06月12日 22:19:43benjamin.petersonsetmessages: + msg68104
2008年06月12日 22:13:16benjamin.petersoncreate

AltStyle によって変換されたページ (->オリジナル) /