[Python-Dev] PEP 393 review

Tue Aug 30 00:20:46 CEST 2011

Le lundi 29 août 2011 21:34:48, vous avez écrit :
> >> Those haven't been ported to the new API, yet. Consider, for example,
> >> d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test;
> >> with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this
> >> is a 25% speedup for PEP 393.
> > 
> > If I understand correctly, the performance now highly depend on the used
> > characters? A pure ASCII string is faster than a string with characters
> > in the ISO-8859-1 charset?
>> How did you infer that from above paragraph??? ASCII and Latin-1 are
> mostly identical in terms of performance - the ASCII decoder should be
> slightly slower than the Latin-1 decoder, since the ASCII decoder needs
> to check for errors, whereas the Latin-1 decoder will never be
> confronted with errors.

I don't compare ASCII and ISO-8859-1 decoders. I was asking if decoding b'abc' 
from ISO-8859-1 is faster than decoding b'ab\xff' from ISO-8859-1, and if yes: 
why?
Your patch replaces PyUnicode_New(size, 255) ... memcpy(), by 
PyUnicode_FromUCS1(). I don't understand how it makes Python faster: 
PyUnicode_FromUCS1() does first scan the input string for the maximum code 
point.
I suppose that the main difference is that the ISO-8859-1 encoded string is 
stored as the UTF-8 encoded string (shared pointer) if all characters of the 
string are ASCII characters. In this case, encoding the string to UTF-8 
doesn't cost anything, we already have the result.
Am I correct?
Victor