homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Optimize UTF-8 decoder with error handlers
Type: performance Stage:
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, python-dev, vstinner
Priority: normal Keywords: patch

Created on 2015年10月02日 14:44 by vstinner, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf8_decoder.patch vstinner, 2015年10月03日 00:01
bench.py vstinner, 2015年10月04日 08:21
Messages (6)
msg252117 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015年10月02日 14:44
The issue #24870 optimized the ASCII decoder with error handlers:
New changeset 3c430259873e by Victor Stinner in branch 'default':
Issue #24870: Optimize the ASCII decoder for error handlers: surrogateescape,
https://hg.python.org/cpython/rev/3c430259873e
We should also optimize the UTF-8 decoder with error handlers.
I will work on a patch next days.
msg252181 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015年10月03日 00:01
Here is a first patch. It is written to keep best performances for valid UTF-8 encoded string, but speedup strings with a few undecodable bytes.
msg252264 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015年10月04日 08:30
Results of the microbenchmark on the UTF-8 decoder.
As expected, performances on valid UTF-8 is unchanged, which was an important goal for me.
Decoding with error handlers optimized by the patch are *much* faster.
backslashreplace is still slow, because I didn't optimize it.
Common platform:
Python unicode implementation: PEP 393
Timer: time.perf_counter
Platform: Linux-4.1.5-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two
CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Bits: int=32, long=64, long long=64, size_t=64, void*=64
CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Timer precision: 55 ns
Platform of campaign before:
SCM: hg revision=f51921883f50 tag=tip branch=default date="2015-10-04 01:19 -0400"
Python version: 3.6.0a0 (default:f51921883f50, Oct 4 2015, 10:19:37) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
Date: 2015年10月04日 10:19:44
Platform of campaign after:
SCM: hg revision=f51921883f50+ tag=tip branch=default date="2015-10-04 01:19 -0400"
Python version: 3.6.0a0 (default:f51921883f50+, Oct 4 2015, 10:14:05) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
Date: 2015年10月04日 10:18:55
---------------------+-------------+--------
valid UTF-8 (strict) | before | after
---------------------+-------------+--------
100 x 10**1 bytes | 297 ns (*) | 297 ns
100 x 10**3 bytes | 7.4 us (*) | 7.44 us
100 x 10**2 bytes | 929 ns (*) | 924 ns
100 x 10**4 bytes | 80.4 us (*) | 80.4 us
---------------------+-------------+--------
Total | 89.1 us (*) | 89 us
---------------------+-------------+--------
------------------+-------------+---------------
ignore | before | after
------------------+-------------+---------------
100 x 10**1 bytes | 6.68 us (*) | 743 ns (-89%)
100 x 10**3 bytes | 561 us (*) | 42.6 us (-92%)
100 x 10**2 bytes | 56.8 us (*) | 4.55 us (-92%)
100 x 10**4 bytes | 6.02 ms (*) | 425 us (-93%)
------------------+-------------+---------------
Total | 6.65 ms (*) | 473 us (-93%)
------------------+-------------+---------------
------------------+-------------+---------------
replace | before | after
------------------+-------------+---------------
100 x 10**1 bytes | 7.61 us (*) | 890 ns (-88%)
100 x 10**3 bytes | 639 us (*) | 50.3 us (-92%)
100 x 10**2 bytes | 64.8 us (*) | 5.37 us (-92%)
100 x 10**4 bytes | 7.09 ms (*) | 505 us (-93%)
------------------+-------------+---------------
Total | 7.81 ms (*) | 561 us (-93%)
------------------+-------------+---------------
------------------+-------------+---------------
surrogateescape | before | after
------------------+-------------+---------------
100 x 10**1 bytes | 7.96 us (*) | 855 ns (-89%)
100 x 10**3 bytes | 674 us (*) | 50.2 us (-93%)
100 x 10**2 bytes | 68.8 us (*) | 5.35 us (-92%)
100 x 10**4 bytes | 7.38 ms (*) | 504 us (-93%)
------------------+-------------+---------------
Total | 8.13 ms (*) | 560 us (-93%)
------------------+-------------+---------------
------------------+-------------+--------
backslashreplace | before | after
------------------+-------------+--------
100 x 10**1 bytes | 7.66 us (*) | 7.89 us
100 x 10**3 bytes | 633 us (*) | 633 us
100 x 10**2 bytes | 64.1 us (*) | 64.6 us
100 x 10**4 bytes | 6.9 ms (*) | 6.93 ms
------------------+-------------+--------
Total | 7.61 ms (*) | 7.64 ms
------------------+-------------+--------
---------------------+-------------+---------------
Summary | before | after
---------------------+-------------+---------------
valid UTF-8 (strict) | 89.1 us (*) | 89 us
ignore | 6.65 ms (*) | 473 us (-93%)
replace | 7.81 ms (*) | 561 us (-93%)
surrogateescape | 8.13 ms (*) | 560 us (-93%)
backslashreplace | 7.61 ms (*) | 7.64 ms
---------------------+-------------+---------------
Total | 30.3 ms (*) | 9.32 ms (-69%)
---------------------+-------------+---------------
msg252319 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015年10月05日 11:44
New changeset 3152e4038d97 by Victor Stinner in branch 'default':
Issue #25301: The UTF-8 decoder is now up to 15 times as fast for error
https://hg.python.org/cpython/rev/3152e4038d97 
msg252320 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015年10月05日 11:44
I pushed my optimization. I close the issue.
msg252321 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015年10月05日 11:49
New changeset 5b9ffea7e7c3 by Victor Stinner in branch 'default':
Issue #25301: Fix compatibility with ISO C90
https://hg.python.org/cpython/rev/5b9ffea7e7c3 
History
Date User Action Args
2022年04月11日 14:58:22adminsetgithub: 69488
2015年10月05日 11:49:36python-devsetmessages: + msg252321
2015年10月05日 11:44:37vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg252320
2015年10月05日 11:44:03python-devsetnosy: + python-dev
messages: + msg252319
2015年10月04日 08:30:32vstinnersetmessages: + msg252264
2015年10月04日 08:21:20vstinnersetfiles: + bench.py
2015年10月03日 00:01:15vstinnersetfiles: + utf8_decoder.patch
keywords: + patch
messages: + msg252181
2015年10月02日 14:44:42vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /