homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Even faster UTF-8 decoding
Type: performance Stage: patch review
Components: Interpreter Core, Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: Arfrever, ezio.melotti, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012年05月26日 09:11 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
decode_utf8_signed_byte.patch serhiy.storchaka, 2012年05月26日 09:11 Two's complement representation trick review
decodebench.py serhiy.storchaka, 2012年05月26日 09:13 Benchmark script
bench-diff.py serhiy.storchaka, 2012年05月26日 09:14 Benchmark results comparator
decode_utf8_range_check.patch serhiy.storchaka, 2012年05月27日 15:49 Simple range check review
decode_utf8_signed_byte-2.patch serhiy.storchaka, 2012年06月22日 22:31 review
Messages (17)
msg161652 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年05月26日 09:11
As strange as it may seem, but using a simple trick was made UTF-8 decoding even more speed up.
Here are the benchmark results.
On 32-bit Linux, AMD Athlon 64 X2:
 vanilla patched
utf-8 'A'*10000 2061 (+3%) 2115
utf-8 '\x80'*10000 383 (-7%) 355
utf-8 '\x80'+'A'*9999 1273 (+1%) 1290
utf-8 '\u0100'*10000 382 (+47%) 562
utf-8 '\u0100'+'A'*9999 1239 (+1%) 1253
utf-8 '\u0100'+'\x80'*9999 383 (+47%) 562
utf-8 '\u8000'*10000 434 (-6%) 409
utf-8 '\u8000'+'A'*9999 1245 (+1%) 1256
utf-8 '\u8000'+'\x80'*9999 382 (+47%) 560
utf-8 '\u8000'+'\u0100'*9999 383 (+44%) 553
utf-8 '\U00010000'*10000 358 (+4%) 373
utf-8 '\U00010000'+'A'*9999 1171 (+0%) 1176
utf-8 '\U00010000'+'\x80'*9999 381 (+44%) 548
utf-8 '\U00010000'+'\u0100'*9999 381 (+44%) 548
utf-8 '\U00010000'+'\u8000'*9999 404 (+0%) 406
On 32-bit Linux, Intel Atom N570:
 vanilla patched
utf-8 'A'*10000 623 (+0%) 626
utf-8 '\x80'*10000 145 (+15%) 167
utf-8 '\x80'+'A'*9999 354 (+2%) 362
utf-8 '\u0100'*10000 164 (+10%) 181
utf-8 '\u0100'+'A'*9999 343 (-0%) 342
utf-8 '\u0100'+'\x80'*9999 164 (+11%) 182
utf-8 '\u8000'*10000 175 (+5%) 183
utf-8 '\u8000'+'A'*9999 349 (+0%) 349
utf-8 '\u8000'+'\x80'*9999 164 (+11%) 182
utf-8 '\u8000'+'\u0100'*9999 164 (+10%) 181
utf-8 '\U00010000'*10000 152 (+11%) 168
utf-8 '\U00010000'+'A'*9999 313 (+0%) 313
utf-8 '\U00010000'+'\x80'*9999 161 (+11%) 179
utf-8 '\U00010000'+'\u0100'*9999 161 (+11%) 179
utf-8 '\U00010000'+'\u8000'*9999 160 (+11%) 177
msg161653 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年05月26日 09:19
I see a slight increase under 64-bit Linux with gcc 4.5.2, too:
 vanilla patched
utf-8 'A'*10000 7857 (+4%)	8210
utf-8 'A'*9999+'\x80' 5392 (+8%)	5843
utf-8 'A'*9999+'\u0100' 2119 (+3%)	2173
utf-8 'A'*9999+'\u8000' 2121 (+2%)	2172
utf-8 'A'*9999+'\U00010000' 2248 (+2%)	2293
utf-8 '\x80'*10000 1015 (+1%)	1021
utf-8 '\x80'+'A'*9999 2747 (+5%)	2877
utf-8 '\x80'*9999+'\u0100' 868 (+0%)	869
utf-8 '\x80'*9999+'\u8000' 857 (+2%)	870
utf-8 '\x80'*9999+'\U00010000' 877 (+0%)	881
utf-8 '\u0100'*10000 1016 (+16%)	1181
utf-8 '\u0100'+'A'*9999 2506 (+3%)	2592
utf-8 '\u0100'+'\x80'*9999 1015 (+16%)	1179
utf-8 '\u0100'*9999+'\u8000' 1015 (+16%)	1182
utf-8 '\u0100'*9999+'\U00010000' 875 (+13%)	992
utf-8 '\u8000'*10000 836 (+18%)	985
utf-8 '\u8000'+'A'*9999 2508 (+3%)	2588
utf-8 '\u8000'+'\x80'*9999 1015 (+16%)	1182
utf-8 '\u8000'+'\u0100'*9999 1014 (+17%)	1182
utf-8 '\u8000'*9999+'\U00010000' 767 (+17%)	894
utf-8 '\U00010000'*10000 730 (+0%)	732
utf-8 '\U00010000'+'A'*9999 2542 (+2%)	2599
utf-8 '\U00010000'+'\x80'*9999 1013 (+17%)	1182
utf-8 '\U00010000'+'\u0100'*9999 1013 (+17%)	1181
utf-8 '\U00010000'+'\u8000'*9999 727 (+0%)	728
msg161654 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年05月26日 09:20
It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok?
msg161656 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年05月26日 09:52
> It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok?
Yes, the patch depends on two facts -- 8-bit bytes and a two's
complement representation of integers. That's why I call it a trick.
However, today CPython will not work on other platforms. However, we can
wrap macro definition in #if/#else/#end and provide the traditional form
(but I don't remember how to test a two's complement representation in
compile time).
msg161657 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年05月26日 10:02
The C standard says, in 6.3.1.3/3
Otherwise [*], the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
[*]: the value cannot be exactly converted, and the target type is not unsigned.
We shouldn't be using unsigned->signed conversions where the source value is out of range for the signed type.
msg161711 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年05月27日 15:49
Yes, this is an implementation-dependent behavior (and on the supported platforms it is implemented well in a certain way).
However, if the continuation byte check to do the simplest way ((ch) >= 0x80 && (ch) < 0xC0), this has the same effect (speed up to +45%) on AMD Athlon.
 vanilla patched
utf-8 'A'*10000 2061 (-2%) 2018
utf-8 '\x80'*10000 383 (+9%) 416
utf-8 '\x80'+'A'*9999 1273 (+3%) 1315
utf-8 '\u0100'*10000 382 (+46%) 558
utf-8 '\u0100'+'A'*9999 1239 (+0%) 1245
utf-8 '\u0100'+'\x80'*9999 383 (+46%) 558
utf-8 '\u8000'*10000 434 (-6%) 408
utf-8 '\u8000'+'A'*9999 1245 (+0%) 1245
utf-8 '\u8000'+'\x80'*9999 382 (+46%) 556
utf-8 '\u8000'+'\u0100'*9999 383 (+45%) 556
utf-8 '\U00010000'*10000 358 (+0%) 359
utf-8 '\U00010000'+'A'*9999 1171 (-0%) 1170
utf-8 '\U00010000'+'\x80'*9999 381 (+30%) 495
utf-8 '\U00010000'+'\u0100'*9999 381 (+30%) 495
utf-8 '\U00010000'+'\u8000'*9999 404 (-5%) 385
On Intel Atom the results did not change or become a little better.
 vanilla patched
utf-8 'A'*10000 623 (+3%) 642
utf-8 '\x80'*10000 145 (+9%) 158
utf-8 '\x80'+'A'*9999 354 (+4%) 367
utf-8 '\u0100'*10000 164 (+0%) 164
utf-8 '\u0100'+'A'*9999 343 (+2%) 351
utf-8 '\u0100'+'\x80'*9999 164 (+1%) 165
utf-8 '\u8000'*10000 175 (-2%) 171
utf-8 '\u8000'+'A'*9999 349 (+3%) 359
utf-8 '\u8000'+'\x80'*9999 164 (+0%) 164
utf-8 '\u8000'+'\u0100'*9999 164 (+0%) 164
utf-8 '\U00010000'*10000 152 (-1%) 150
utf-8 '\U00010000'+'A'*9999 313 (+2%) 319
utf-8 '\U00010000'+'\x80'*9999 161 (+1%) 162
utf-8 '\U00010000'+'\u0100'*9999 161 (+1%) 162
utf-8 '\U00010000'+'\u8000'*9999 160 (-2%) 156
msg161715 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年05月27日 18:37
> However, if the continuation byte check to do the simplest way ((ch) >= 
> 0x80 && (ch) < 0xC0), this has the same effect (speed up to +45%) on 
> AMD Athlon.
Doesn't produce any significant speedup on Intel Core i5-2500.
msg161759 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012年05月28日 06:42
> It seems the patch relies on a two's complement representation of
> integers. Mark, do you think that's ok?
(1) Relying on two's complement integers seems fine to me: we're already relying on it in other places in Python (e.g., bitwise operations for ints in Python 2.x); it seems unlikely Python's going to run into current or future hardware that uses anything else; and any special-case code for ones' complement or sign-magnitude is going to be essentially unused and awkward to test, so it's probably better not to have it in the codebase at all.
In an ideal world, I guess we'd add some configure-time tests for two's complement so that in the unlikely event that Python *does* meet a non two's complement platform the build fails early with a clear message rather than the tests failing in strange ways.
(2) The bit that Martin identifies: relying on conversion from unsigned to signed doing a wraparound modulo 2**<suitable n> is a bit more troubling; it's something that I try to avoid where possible, but that's not always easy. I don't recall ever having encountered this causing problems in practice, though---it feels like a leftover from a non-two's complement world where processors would have a hard time giving wraparound semantics, so the C standard didn't want to require it. Again, if we're going to rely on this, it would probably make sense to have some configure-time checks; and it would be better not to rely on it at all without a really good reason.
msg163501 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月22日 22:31
Here is a patch that uses some sort of autodetection.
msg163583 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月23日 11:30
Any chance to commit the patch before final feature freeze?
msg163586 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年06月23日 11:36
> Any chance to commit the patch before final feature freeze?
I'll defer to Mark :-)
msg163589 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012年06月23日 11:43
Okay, will look at this this afternoon.
msg163607 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012年06月23日 13:33
I'm happy to apply the 'decode_utf8_range_check.patch'; I'll do that unless there are objections. The code is clearer than the original, and if we get a speedup into the bargain then I don't see a reason not to apply this.
I'm less comfortable with either the original patch, or the most recent one (decode_utf8_signed_byte-2.patch).
msg163611 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012年06月23日 13:51
Serhiy, does this patch also fix #8271?
If so, can you also include the tests I wrote for it?
msg163672 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年06月23日 20:45
New changeset 3214c9ebcf5e by Mark Dickinson in branch 'default':
Issue #14923: Optimize continuation-byte check in UTF-8 decoding. Patch by Serhiy Storchaka.
http://hg.python.org/cpython/rev/3214c9ebcf5e 
msg163673 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012年06月23日 20:47
Patch applied. Closing.
Ezio: the patch is pure optimization, with no change in semantics; I don't see how it could fix #8271.
msg163675 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月23日 21:10
> Serhiy, does this patch also fix #8271?
No, this patch not change behavior. But updated patch for issue 8271 now
contains this patch (I hope this will help merge).
> If so, can you also include the tests I wrote for it?
Your tests included in patch for issue 8271.
History
Date User Action Args
2022年04月11日 14:57:30adminsetgithub: 59128
2012年06月23日 21:10:43serhiy.storchakasetmessages: + msg163675
2012年06月23日 20:47:20mark.dickinsonsetstatus: open -> closed
resolution: fixed
messages: + msg163673
2012年06月23日 20:45:26python-devsetmessages: + msg163672
2012年06月23日 13:51:00ezio.melottisetmessages: + msg163611
2012年06月23日 13:33:24mark.dickinsonsetmessages: + msg163607
2012年06月23日 11:43:32mark.dickinsonsetassignee: mark.dickinson
messages: + msg163589
2012年06月23日 11:36:11pitrousetmessages: + msg163586
2012年06月23日 11:30:38serhiy.storchakasetmessages: + msg163583
2012年06月22日 22:31:57serhiy.storchakasetfiles: + decode_utf8_signed_byte-2.patch

messages: + msg163501
2012年05月28日 06:42:50mark.dickinsonsetmessages: + msg161759
2012年05月27日 18:37:04pitrousetmessages: + msg161715
2012年05月27日 15:49:48serhiy.storchakasetfiles: + decode_utf8_range_check.patch

messages: + msg161711
2012年05月26日 10:02:09loewissetmessages: + msg161657
2012年05月26日 09:52:09serhiy.storchakasetmessages: + msg161656
2012年05月26日 09:20:52pitrousetstage: commit review -> patch review
2012年05月26日 09:20:38pitrousetmessages: + msg161654
stage: commit review
2012年05月26日 09:19:46pitrousetmessages: + msg161653
2012年05月26日 09:14:53serhiy.storchakasetfiles: + bench-diff.py
2012年05月26日 09:13:51serhiy.storchakasetfiles: + decodebench.py
2012年05月26日 09:11:07serhiy.storchakacreate

AltStyle によって変換されたページ (->オリジナル) /