homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re: limit the maximum capturing group to 1,073,741,823, reduce sizeof(match_context).
Type: resource usage Stage: patch review
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, malin, mrabarnett, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2022年04月08日 06:54 by malin, last changed 2022年04月11日 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 32411 open malin, 2022年04月08日 07:02
Messages (1)
msg416960 - (view) Author: Ma Lin (malin) * Date: 2022年04月08日 06:54
These changes reduce sizeof(match_context):
- 32-bit build: 36 bytes, no change.
- 64-bit build: 72 bytes -> 56 bytes.
sre uses stack and `match_context` struct to simulate recursive call, smaller struct brings:
 - deeper recursive call
 - less memory consume
 - less memory realloc
Here is a test, if limit the stack size to 1 GiB, the max available value of n is:
 re.match(r'(ab)*', n * 'ab') # need to save MARKs
 72 bytes: n = 11,184,808
 64 bytes: n = 12,201,609
 56 bytes: n = 13,421,770
 re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs
 72 bytes: n = 13,421,770
 64 bytes: n = 14,913,078
 56 bytes: n = 16,777,213
 
1,073,741,823 capturing groups should enough for almost all users.
If limit it to 16,383 (2-byte integer), the context size may reduce more. But maybe some patterns generated by program will have more than this number of capturing groups.
1️⃣Performance:
Before
 regex_dna: Mean +- std dev: 149 ms +- 1 ms
 regex_effbot: Mean +- std dev: 2.22 ms +- 0.02 ms
 regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
 my benchmark[1]: 13.9 sec +- 0.0 sec
Commit 1. limit the maximum capture group to 1,073,741,823
 regex_dna: Mean +- std dev: 150 ms +- 1 ms
 regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
 regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
 my benchmark: 13.8 sec +- 0.0 sec
Commit 2. further reduce sizeof(SRE(match_context))
 regex_dna: Mean +- std dev: 150 ms +- 1 ms
 regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
 regex_v8: Mean +- std dev: 22.2 ms +- 0.1 ms
 my benchmark: 13.8 sec +- 0.1 sec
If further change the types of toplevel/jump from int to char, in 32-bit build sizeof(match_context) will be reduced from 36 to 32 (In 64-bit build still 56). But it's slower on 64-bit build, so I didn't adopt it:
 regex_dna: Mean +- std dev: 150 ms +- 1 ms
 regex_effbot: Mean +- std dev: 2.18 ms +- 0.01 ms
 regex_v8: Mean +- std dev: 22.4 ms +- 0.1 ms
 my benchmark: 14.1 sec +- 0.0 sec
2️⃣ The type of match_context.count is Py_ssize_t
 - If change it to 4-byte integer, need to modify some engine code.
 - If keep it as Py_ssize_t, SRE_MAXREPEAT may >= 4 GiB in future versions. 
 Currently SRE_MAXREPEAT can't >= 4 GiB.
So the type of match_context.count is unchanged.
[1] My re benchmark, it uses 16 patterns to process 100 MiB text data:
https://github.com/animalize/re_benchmarks 
History
Date User Action Args
2022年04月11日 14:59:58adminsetgithub: 91412
2022年04月08日 07:02:49malinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request30437
2022年04月08日 06:54:08malincreate

AltStyle によって変換されたページ (->オリジナル) /