homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Faster UTF-16 encoding
Type: performance Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, asvetlov, ezio.melotti, pitrou, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012年06月07日 13:56 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
encode-utf16.patch serhiy.storchaka, 2012年06月07日 13:56 review
encode-utf16-2.patch serhiy.storchaka, 2012年06月15日 19:35 review
Messages (11)
msg162473 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月07日 13:56
In pair to issue14624 here is a patch than speed up UTF-16 encoding in several times. In addition, it fixes an unsafe check of an integer overflow.
Here are the results of benchmarking. See benchmark tools in https://bitbucket.org/storchaka/cpython-stuff repository.
On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:
Py2.7 Py3.2 Py3.3 patched
457 (+575%) 458 (+573%) 1077 (+186%) 3083 encode utf-16le 'A'*10000
457 (+579%) 493 (+529%) 1084 (+186%) 3102 encode utf-16le '\x80'*10000
489 (+534%) 458 (+577%) 1081 (+187%) 3102 encode utf-16le '\x80'+'A'*9999
457 (+1261%) 493 (+1161%) 1116 (+457%) 6219 encode utf-16le '\u0100'*10000
489 (+1266%) 458 (+1358%) 1126 (+493%) 6678 encode utf-16le '\u0100'+'A'*9999
489 (+1263%) 458 (+1355%) 1129 (+490%) 6666 encode utf-16le '\u0100'+'\x80'*9999
457 (+1240%) 493 (+1142%) 1118 (+448%) 6125 encode utf-16le '\u8000'*10000
489 (+1271%) 458 (+1363%) 1127 (+495%) 6702 encode utf-16le '\u8000'+'A'*9999
489 (+1271%) 458 (+1364%) 1129 (+494%) 6705 encode utf-16le '\u8000'+'\x80'*9999
489 (+1135%) 458 (+1218%) 1136 (+432%) 6038 encode utf-16le '\u8000'+'\u0100'*9999
498 (+128%) 505 (+125%) 630 (+80%) 1137 encode utf-16le '\U00010000'*10000
489 (+35%) 458 (+44%) 360 (+83%) 659 encode utf-16le '\U00010000'+'A'*9999
489 (+35%) 458 (+44%) 359 (+84%) 660 encode utf-16le '\U00010000'+'\x80'*9999
489 (+36%) 458 (+45%) 361 (+84%) 663 encode utf-16le '\U00010000'+'\u0100'*9999
489 (+36%) 458 (+45%) 361 (+84%) 663 encode utf-16le '\U00010000'+'\u8000'*9999
447 (+507%) 493 (+450%) 1086 (+150%) 2712 encode utf-16be 'A'*10000
447 (+513%) 493 (+456%) 1080 (+154%) 2739 encode utf-16be '\x80'*10000
489 (+458%) 458 (+496%) 1079 (+153%) 2729 encode utf-16be '\x80'+'A'*9999
447 (+498%) 494 (+441%) 1118 (+139%) 2672 encode utf-16be '\u0100'*10000
489 (+464%) 458 (+502%) 1128 (+144%) 2756 encode utf-16be '\u0100'+'A'*9999
489 (+463%) 458 (+502%) 1131 (+144%) 2755 encode utf-16be '\u0100'+'\x80'*9999
447 (+500%) 493 (+444%) 1119 (+139%) 2680 encode utf-16be '\u8000'*10000
489 (+463%) 458 (+502%) 1126 (+145%) 2755 encode utf-16be '\u8000'+'A'*9999
489 (+464%) 458 (+502%) 1129 (+144%) 2757 encode utf-16be '\u8000'+'\x80'*9999
489 (+479%) 458 (+518%) 1137 (+149%) 2829 encode utf-16be '\u8000'+'\u0100'*9999
499 (+102%) 506 (+99%) 630 (+60%) 1009 encode utf-16be '\U00010000'*10000
489 (+6%) 458 (+13%) 360 (+44%) 519 encode utf-16be '\U00010000'+'A'*9999
489 (+6%) 458 (+13%) 359 (+44%) 518 encode utf-16be '\U00010000'+'\x80'*9999
489 (+6%) 458 (+13%) 361 (+44%) 519 encode utf-16be '\U00010000'+'\u0100'*9999
489 (+6%) 458 (+13%) 361 (+44%) 519 encode utf-16be '\U00010000'+'\u8000'*9999
msg162701 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年06月13日 09:37
Here are results under 64-bit Linux on a Core i5-2500K:
3.3 patched
3327 (+360%) 15304 encode utf-16le 'A'*10000
3314 (+335%) 14413 encode utf-16le '\x80'*10000
3315 (+578%) 22472 encode utf-16le '\x80'+'A'*9999
2390 (+668%) 18345 encode utf-16le '\u0100'*10000
2390 (+668%) 18364 encode utf-16le '\u0100'+'A'*9999
2324 (+684%) 18219 encode utf-16le '\u0100'+'\x80'*9999
2385 (+664%) 18227 encode utf-16le '\u8000'*10000
2390 (+669%) 18383 encode utf-16le '\u8000'+'A'*9999
2390 (+663%) 18232 encode utf-16le '\u8000'+'\x80'*9999
2385 (+601%) 16708 encode utf-16le '\u8000'+'\u0100'*9999
1601 (-4%) 1542 encode utf-16le '\U00010000'*10000
1209 (+20%) 1448 encode utf-16le '\U00010000'+'A'*9999
1210 (+20%) 1447 encode utf-16le '\U00010000'+'\x80'*9999
1209 (+20%) 1446 encode utf-16le '\U00010000'+'\u0100'*9999
1209 (+20%) 1446 encode utf-16le '\U00010000'+'\u8000'*9999
3237 (+562%) 21422 encode utf-16be 'A'*10000
3294 (+500%) 19779 encode utf-16be '\x80'*10000
3290 (+357%) 15036 encode utf-16be '\x80'+'A'*9999
2382 (+209%) 7354 encode utf-16be '\u0100'*10000
2381 (+208%) 7342 encode utf-16be '\u0100'+'A'*9999
2377 (+209%) 7347 encode utf-16be '\u0100'+'\x80'*9999
2382 (+207%) 7317 encode utf-16be '\u8000'*10000
2381 (+208%) 7343 encode utf-16be '\u8000'+'A'*9999
2376 (+209%) 7343 encode utf-16be '\u8000'+'\x80'*9999
2377 (+206%) 7281 encode utf-16be '\u8000'+'\u0100'*9999
1598 (-42%) 930 encode utf-16be '\U00010000'*10000
1208 (+19%) 1436 encode utf-16be '\U00010000'+'A'*9999
1208 (+19%) 1436 encode utf-16be '\U00010000'+'\x80'*9999
1205 (+19%) 1434 encode utf-16be '\U00010000'+'\u0100'*9999
1205 (+19%) 1433 encode utf-16be '\U00010000'+'\u8000'*9999
msg162822 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月14日 20:29
Thank you, Antoine.
> 3327 (+360%) 15304 encode utf-16le 'A'*10000
> 3314 (+335%) 14413 encode utf-16le '\x80'*10000
> 3290 (+357%) 15036 encode utf-16be '\x80'+'A'*9999
It must be a fluctuation (-30-40%). For all UCS1 strings the same code
is used.
> 1598 (-42%) 930 encode utf-16be '\U00010000'*10000
This is most likely the fluctuation too. Code for non-BMP characters is
different from the code for other characters in UCS4 string, but
unlikely a difference is 1.5x. Reproduced whether this result?
On 32-bit Linux, Intel Atom N570 @ 1.66GHz:
Py2.7 Py3.2 Py3.3 patched
273 (+229%) 274 (+227%) 333 (+169%) 897 encode utf-16le 'A'*10000
274 (+226%) 275 (+225%) 334 (+168%) 894 encode utf-16le '\x80'*10000
274 (+231%) 275 (+230%) 334 (+172%) 908 encode utf-16le '\x80'+'A'*9999
273 (+752%) 275 (+746%) 276 (+743%) 2326 encode utf-16le '\u0100'*10000
274 (+695%) 275 (+692%) 276 (+689%) 2177 encode utf-16le '\u0100'+'A'*9999
274 (+739%) 275 (+736%) 276 (+733%) 2300 encode utf-16le '\u0100'+'\x80'*9999
274 (+739%) 275 (+736%) 276 (+733%) 2298 encode utf-16le '\u8000'*10000
274 (+697%) 274 (+697%) 276 (+691%) 2184 encode utf-16le '\u8000'+'A'*9999
274 (+741%) 274 (+741%) 277 (+731%) 2303 encode utf-16le '\u8000'+'\x80'*9999
274 (+770%) 275 (+767%) 276 (+764%) 2384 encode utf-16le '\u8000'+'\u0100'*9999
279 (+51%) 279 (+51%) 217 (+94%) 422 encode utf-16le '\U00010000'*10000
274 (+6%) 274 (+6%) 162 (+79%) 290 encode utf-16le '\U00010000'+'A'*9999
274 (+6%) 274 (+6%) 162 (+79%) 290 encode utf-16le '\U00010000'+'\x80'*9999
273 (+5%) 275 (+5%) 162 (+78%) 288 encode utf-16le '\U00010000'+'\u0100'*9999
274 (+5%) 275 (+5%) 162 (+78%) 288 encode utf-16le '\U00010000'+'\u8000'*9999
274 (+152%) 275 (+151%) 334 (+107%) 690 encode utf-16be 'A'*10000
274 (+154%) 275 (+153%) 334 (+109%) 697 encode utf-16be '\x80'*10000
274 (+152%) 275 (+151%) 333 (+108%) 691 encode utf-16be '\x80'+'A'*9999
274 (+146%) 275 (+145%) 276 (+145%) 675 encode utf-16be '\u0100'*10000
274 (+146%) 275 (+145%) 276 (+145%) 675 encode utf-16be '\u0100'+'A'*9999
274 (+145%) 275 (+144%) 276 (+143%) 671 encode utf-16be '\u0100'+'\x80'*9999
274 (+145%) 275 (+144%) 276 (+143%) 672 encode utf-16be '\u8000'*10000
275 (+147%) 275 (+147%) 276 (+146%) 680 encode utf-16be '\u8000'+'A'*9999
274 (+146%) 275 (+145%) 276 (+144%) 674 encode utf-16be '\u8000'+'\x80'*9999
275 (+143%) 275 (+143%) 276 (+142%) 667 encode utf-16be '\u8000'+'\u0100'*9999
279 (+26%) 279 (+26%) 217 (+62%) 351 encode utf-16be '\U00010000'*10000
274 (-2%) 275 (-3%) 162 (+65%) 268 encode utf-16be '\U00010000'+'A'*9999
274 (-2%) 275 (-3%) 162 (+65%) 268 encode utf-16be '\U00010000'+'\x80'*9999
274 (-4%) 275 (-4%) 162 (+63%) 264 encode utf-16be '\U00010000'+'\u0100'*9999
274 (-3%) 275 (-4%) 162 (+64%) 265 encode utf-16be '\U00010000'+'\u8000'*9999
msg162924 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年06月15日 17:34
Serhiy, the tests crash here in debug mode:
$ ./python -m test -v test_unicode
== CPython 3.3.0a4+ (default:b17c8005e08a+, Jun 15 2012, 19:28:56) [GCC 4.5.2]
== Linux-2.6.38.8-desktop-10.mga-x86_64-with-mandrake-1-Official little-endian
== /home/antoine/cpython/default/build/test_python_2567
Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1)
[1/1] test_unicode
test_formatter_field_name_split (test.test_unicode.StringModuleTest) ... ok
test_formatter_parser (test.test_unicode.StringModuleTest) ... ok
test___contains__ (test.test_unicode.UnicodeTest) ... ok
test_additional_rsplit (test.test_unicode.UnicodeTest) ... ok
test_additional_split (test.test_unicode.UnicodeTest) ... ok
test_ascii (test.test_unicode.UnicodeTest) ... ok
test_aswidechar (test.test_unicode.UnicodeTest) ... ok
test_aswidecharstring (test.test_unicode.UnicodeTest) ... ok
test_bug1001011 (test.test_unicode.UnicodeTest) ... ok
test_bytes_comparison (test.test_unicode.UnicodeTest) ... ok
test_capitalize (test.test_unicode.UnicodeTest) ... ok
test_casefold (test.test_unicode.UnicodeTest) ... ok
test_center (test.test_unicode.UnicodeTest) ... ok
test_codecs (test.test_unicode.UnicodeTest) ... python: Objects/unicodeobject.c:5401: _PyUnicode_EncodeUTF16: Assertion `(Py_uintptr_t)(((((((((PyObject*)(v))->ob_type))->tp_flags & ((1L<<27))) != 0)) ? (void) (0) : __assert_fail ("((((((PyObject*)(v))->ob_type))->tp_flags & ((1L<<27))) != 0)", "Objects/unicodeobject.c", 5401, __PRETTY_FUNCTION__)), (((PyBytesObject *)(v))->ob_sval)) & 1 == 0' failed.
Fatal Python error: Aborted
Current thread 0x00007faa4980e700:
 File "/home/antoine/cpython/default/Lib/test/test_unicode.py", line 1443 in test_codecs
 File "/home/antoine/cpython/default/Lib/unittest/case.py", line 385 in _executeTestPart
 File "/home/antoine/cpython/default/Lib/unittest/case.py", line 440 in run
 File "/home/antoine/cpython/default/Lib/unittest/case.py", line 492 in __call__
 File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run
 File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__
 File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run
 File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__
 File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run
 File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__
 File "/home/antoine/cpython/default/Lib/unittest/runner.py", line 168 in run
 File "/home/antoine/cpython/default/Lib/test/support.py", line 1383 in _run_suite
 File "/home/antoine/cpython/default/Lib/test/support.py", line 1417 in run_unittest
 File "/home/antoine/cpython/default/Lib/test/test_unicode.py", line 1954 in test_main
 File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 1237 in runtest_inner
 File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 918 in runtest
 File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 710 in main
 File "/home/antoine/cpython/default/Lib/test/__main__.py", line 13 in <module>
 File "/home/antoine/cpython/default/Lib/runpy.py", line 75 in _run_code
 File "/home/antoine/cpython/default/Lib/runpy.py", line 162 in _run_module_as_main
Abandon
msg162929 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月15日 19:35
> Serhiy, the tests crash here in debug mode:
My fault. It's operator precedence issue in the assert expression. Gcc
warns about it:
Objects/unicodeobject.c: In function ‘_PyUnicode_EncodeUTF16’:
Objects/unicodeobject.c:5401: warning: suggest parentheses around comparison in operand of ‘&’
Here is a fixed patch.
msg162930 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年06月15日 20:18
New changeset acca141fda80 by Antoine Pitrou in branch 'default':
Issue #15026: utf-16 encoding is now significantly faster (up to 10x).
http://hg.python.org/cpython/rev/acca141fda80 
msg162931 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年06月15日 20:19
Thank you for the quick turnaround! The patch is now pushed in 3.3.
msg162933 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012年06月15日 20:21
It would be nice to mention the improvement in the What's New in Python 3.3 doc (Optimizations section).
msg162934 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年06月15日 20:25
New changeset 35667fc5f785 by Antoine Pitrou in branch 'default':
Mention the UTF-16 encoding speedup in the whatsnew (issue #15026).
http://hg.python.org/cpython/rev/35667fc5f785 
msg162960 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月16日 08:43
Thank you for pushing. :-) Are you interested in a faster UTF-32 codec?
msg162961 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年06月16日 09:03
> Thank you for pushing. :-) Are you interested in a faster UTF-32 codec?
Not much :) I know you posted issues on that, but I think UTF-32 is
quite low priority.
History
Date User Action Args
2022年04月11日 14:57:31adminsetgithub: 59231
2012年06月16日 09:03:30pitrousetmessages: + msg162961
2012年06月16日 08:43:11serhiy.storchakasetmessages: + msg162960
2012年06月15日 20:25:25python-devsetmessages: + msg162934
2012年06月15日 20:21:43vstinnersetmessages: + msg162933
2012年06月15日 20:19:14pitrousetstatus: open -> closed
resolution: fixed
messages: + msg162931

stage: resolved
2012年06月15日 20:18:32python-devsetnosy: + python-dev
messages: + msg162930
2012年06月15日 19:35:12serhiy.storchakasetfiles: + encode-utf16-2.patch

messages: + msg162929
2012年06月15日 17:34:47pitrousetmessages: + msg162924
2012年06月14日 20:29:52serhiy.storchakasetmessages: + msg162822
2012年06月13日 09:37:49pitrousetmessages: + msg162701
2012年06月07日 13:56:13serhiy.storchakacreate

AltStyle によって変換されたページ (->オリジナル) /