Issue 21165: Optimize str.translate() for replacement with substrings and non-ASCII strings

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65364

classification

Title:	Optimize str.translate() for replacement with substrings and non-ASCII strings
Type:	performance	Stage:	patch review
Components:	Versions:	Python 3.5

process

Dependencies:	Superseder:
Status:	closed	Resolution:	out of date
Assigned To:	Nosy List:	ezio.melotti, josh.r, serhiy.storchaka, sir-sigurd, vstinner
Priority:	normal	Keywords:

Created on 2014年04月07日 09:10 by vstinner, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
translate_script.py	vstinner, 2014年04月07日 09:10

Messages (6)
msg215677 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年04月07日 09:10
In issue #21118, I optimized str.translate() in Python 3.5 for ASCII 1:1 mapping and ASCII deletion. My optimization is not used if a character is replaced with a string (ex: "abc".translate({ord('a'): "xxx"})) and for non-ASCII strings. translate_script.py is a simple benchmark for 1:1 mapping. It should be enhanced to benchmark also replacement strings.
msg215681 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2014年04月07日 09:24
codecs.charmap_build() (PyUnicode_BuildEncodingMap()) creates a C array ("a three-level trie") for fast lookup. It is used with codecs.charmap_encode() for 8-bit encodings. We may reuse it.
msg218233 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2014年05月10日 19:43
Aren't there similar benchmarks in the benchmarks repo? If not, would it be reasonable to add this there?
msg253002 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2015年10月14日 16:20
This issue was more a reminder for myself, and I'm not more interested to implement this optimization. The method is already fast enough.
msg253026 - (view)	Author: Josh Rosenberg (josh.r) * (Python triager)	Date: 2015年10月15日 02:06
I actually have a patch (still requires a little cleanup) that makes translations for non-ASCII and 1-n translations substantially faster. I've been delaying posting it largely because it makes significant changes to str.maketrans so it returns a special mapping that can be used far more efficiently than Python dicts. The effects of this are: 1. str.maketrans takes a little longer to run (when mappings are defined outside the latin-1 range, it takes about 6x as much time), and technically, the runtime is unbounded. I'm using "Perfect Hashing" to make a chaining free lookup table, but this involves randomly generating the parameters until they produce a collision free set of mappings; the number of rounds of generation is probabilistically very small (IIRC, for pathological cases, you'd still have a >50% chance of success for any random set of parameters, so the odds of failing to map after more than a dozen or so attempts is infinitesimal) 2. The resulting object, while it obeys the contract for collections.abc.Mapping, is not a dict, nor is it mutable, which would be a backwards incompatible change. Under the current design, the mapping uses ~2x the space as the old dict (largely because it actually stores the dict internally to preserve references and simplify basic lookups). In exchange for the longer time to do str.maketrans and the slightly higher memory, it provides: 1. Improved runtime for ASCII->Unicode (and vice-versa) of roughly 15-20x 2. Similar improvements for 1-n translations (regardless of whether non-ASCII is involved) 3. In general, much more consistent translation performance; the variance based on the contents of the mapping and the contents of the string is much lower, making it behave more like the old Py2 str.translate (and Py3 bytes.translate); translation is almost always faster than any other approach, instead of being a pessimization. I don't know how to float changes that would make fairly substantial changes to existing APIs though, so I'm not sure how to proceed. I'd like translation to be beneficial (the optimization made in #21118 didn't actually improve my use case of stripping diacritics to convert to ASCII equivalent characters from latin-1 and related characters), but I have no good solutions that don't mess around with the API (I'd considered trying to internally cache "compiled" translation tables like the re module does, but the tables are mutable dicts, so caching can't be based on identity, and can't use the dicts as keys, which makes it difficult).
msg326788 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2018年10月01日 10:26
> I actually have a patch (...) Please open a new issue, since that one is closed. You can reference this issue from your new issue.

History
Date	User	Action	Args
2022年04月11日 14:58:01	admin	set	github: 65364
2018年10月01日 10:26:58	vstinner	set	messages: + msg326788
2018年09月30日 17:30:19	sir-sigurd	set	nosy: + sir-sigurd
2015年10月15日 02:06:04	josh.r	set	messages: + msg253026
2015年10月14日 16:20:20	vstinner	set	status: open -> closed resolution: out of date messages: + msg253002
2014年05月10日 19:43:26	ezio.melotti	set	nosy: + ezio.melotti messages: + msg218233 stage: patch review
2014年04月07日 21:00:01	josh.r	set	nosy: + josh.r
2014年04月07日 09:24:43	vstinner	set	messages: + msg215681
2014年04月07日 09:10:44	vstinner	create

homepage