homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Faster compiling of charset regexpes
Type: performance Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: 19327 Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, mrabarnett, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013年10月21日 12:01 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
re_mk_bitmap.patch serhiy.storchaka, 2013年10月21日 12:01 review
re_optimize_charset.patch serhiy.storchaka, 2013年10月24日 19:24 review
re_optimize_charset_2.patch serhiy.storchaka, 2013年10月25日 21:02 review
Messages (6)
msg200755 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年10月21日 12:01
Here is a patch which speed up compiling of regular expressions with big charsets.
Microbenchmark:
$ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))" "compile(r, 0)"
Unpatched (but with fixed issue19327): 119 msec per loop
Patched: 59.6 msec per loop
Compiling regular expressions with big charset was main cause of slowing down importing the email.message module (issue11454).
msg201166 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年10月24日 19:24
Here is a more complex patch which optimizes charset compiling. It affects small charsets too. Big charsets now supports same optimizations as small charsets. Optimized bitmap now can be used even if the charset contains category items or non-bmp characters.
$ ./python -m timeit "from sre_compile import compile; r = '[0-9]+'" "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 457 usec per loop
Patched: 1000 loops, best of 3: 368 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[ \t\n\r\v\f]+'" "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 490 usec per loop
Patched: 1000 loops, best of 3: 413 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[0-9A-Za-z_]+'" "compile(r, 0)"
Unpatched: 1000 loops, best of 3: 760 usec per loop
Patched: 1000 loops, best of 3: 527 usec per loop
$ ./python -m timeit "from sre_compile import compile; r = r'[^\ud800-\udfff]*'" "compile(r, 0)"
Unpatched: 100 loops, best of 3: 2.07 msec per loop
Patched: 1000 loops, best of 3: 1.44 msec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[\u0410-\u042f\u0430-\u043f\u0404\u0406\u0407\u0454\u0456\u0457\u0490\u0491]+'" "compile(r, 0)"
Unpatched: 100 loops, best of 3: 8.24 msec per loop
Patched: 100 loops, best of 3: 2.13 msec per loop
$ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))" "compile(r, 0)"
Unpatched: 10 loops, best of 3: 119 msec per loop
Patched: 10 loops, best of 3: 24.1 msec per loop
msg201292 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年10月25日 21:02
Updated patch addresses Antoine's comments. One my bug fixed.
msg201419 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013年10月27日 06:22
New changeset d5498d9d9bb0 by Serhiy Storchaka in branch 'default':
Issue #19329: Optimized compiling charsets in regular expressions.
http://hg.python.org/cpython/rev/d5498d9d9bb0 
msg201420 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年10月27日 06:24
Thank you Antoine for your review.
msg230335 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014年10月31日 11:55
New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
Backported the optimization of compiling charsets in regular expressions
https://hg.python.org/cpython/rev/ebd48b4f650d 
History
Date User Action Args
2022年04月11日 14:57:52adminsetgithub: 63528
2014年10月31日 11:55:20python-devsetmessages: + msg230335
2013年10月27日 06:24:34serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg201420

stage: patch review -> resolved
2013年10月27日 06:22:02python-devsetnosy: + python-dev
messages: + msg201419
2013年10月25日 21:02:01serhiy.storchakasetfiles: + re_optimize_charset_2.patch

messages: + msg201292
2013年10月24日 19:24:58serhiy.storchakasetfiles: + re_optimize_charset.patch

messages: + msg201166
title: Faster compiling of big charset regexpes -> Faster compiling of charset regexpes
2013年10月21日 12:01:44serhiy.storchakasetdependencies: + re doesn't work with big charsets
2013年10月21日 12:01:18serhiy.storchakacreate

AltStyle によって変換されたページ (->オリジナル) /