Issue 17694: Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61894

classification

Title:	Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation
Type:	Stage:	resolved
Components:	Versions:	Python 3.4

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	python-dev, serhiy.storchaka, vladistan, vstinner
Priority:	normal	Keywords:	patch

Created on 2013年04月10日 23:53 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue17694.patch	vladistan, 2013年04月13日 21:41	Patch to fix the issue
benchmark.py	vladistan, 2013年04月13日 21:42	Benchmark module
writer_minlen.patch	vstinner, 2013年04月14日 01:38	review

Messages (10)
msg186537 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月10日 23:53
The _PyUnicodeWriter API is used in many functions to create Unicode strings, especially decoders. Performances are not optimal: it is not possible to specify the minimum length of the buffer if the overallocation is disabled. It may help #17693 for example.
msg186857 - (view)	Author: Vladimir Korolev (vladistan) *	Date: 2013年04月13日 21:41
We have this issue triaged for at CPython hackathon in Boston. Here is a patch for the issue. We only tested on Mac OS X 10.8.3, which has zoned allocator, so the memory profile is exactly the same with our without this patch. The running time seems to be slightly better with the patch. The benchmark we used runs for about 5.6 seconds with the patch vs. 5.9 seconds without the patch. We run the benchmark multiple times and the results seem to be consistent. Here are the results of the benchmarking and memory profile testing: With Fix Without Fix Mem 1535 nodes (6377296 bytes) 1535 nodes (6378144 bytes) Time 5.68 5.9 sec The memory profile was measured by the MacOS X 'heap' command. The timings come from attached benchmark module. The original benchmark module is taken from here http://bugs.python.org/file25558/benchmark.py and was modified to test this issue.
msg186858 - (view)	Author: Vladimir Korolev (vladistan) *	Date: 2013年04月13日 21:42
For some reason can't figure out how to attach multiple files. So here is the benchmark module
msg186859 - (view)	Author: Vladimir Korolev (vladistan) *	Date: 2013年04月13日 21:49
I'd like to note that the actual patch was written by Adam.Duston http://bugs.python.org/user17706 I just verified the results, measured the time/memory performance submitted the patch.
msg186875 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月14日 01:38
Attached patch changes _PyUnicodeWriter_Init() API: it now only has one argument (the writer). Minimum length and overallocation must be configured using attributes. The problem with the old API was that it was not possible to configure minimum length and overallocation separatly. Disable overallocation in CJK decoders: only set the minimum length. Other changes: * Add min_char character to _PyUnicodeWriter. It is currenctly unused. Using _PyUnicodeWriter_Prepare(writer, 0, min_char) is different because it allocates immediatly the buffer, and calling _PyUnicodeWriter_Prepare() with size=0 is not supported (it does not widen the buffer if maxchar is bigger). * unicode_decode_call_errorhandler_writer() only enables overallocation if the replaced string is longer than 1 character * PyUnicode_DecodeRawUnicodeEscape() and _PyUnicode_DecodeUnicodeInternal() set minimum length instead of preallocating the whole buffer. It avoids the need of widen the buffer if the first written character is the biggest character. It also avoids an useless memory allocation if the decoder fails before the first write. * _PyUnicode_DecodeUnicodeInternal() checks for integer overflow when computing the minimum length * _PyUnicodeWriter_Update() is now responsible to set size to zero if readonly is set The goal is to delay the first allocation until the first real write to be able to choose correctly the maximum character and the buffer size. If the buffer is allocated before the first write, even the first write must widen and/or enlarge the buffer.
msg186876 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月14日 01:41
I don't see how issue17694.patch can speedup Python because min_length is zero when overallocation is disabled. It may be noise of the benchmark script.
msg186877 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月14日 01:54
PyUnicode_DecodeUnicodeEscape() should set writer.min_length instead of using _PyUnicodeWriter_Prepare(), but the following assertion fails (because writer.size is zero by default): assert(writer.pos < writer.size \|\| (writer.pos == writer.size && c == '\n')); I don't understand this assertion, so I don't know how to modify it.
msg186878 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月14日 01:59
PyUnicode_DecodeCharmap() still uses _PyUnicodeWriter_Prepare() (even with my patch). It may be interesting to benchmark min_length vs prepare. If min_length is not slower, it should be used instead of prepare to avoid widen the buffer if the first written character is non-ASCII, b'\x80'.decode('cp1252') for example.
msg187207 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2013年04月17日 21:05
New changeset edf029fc9591 by Victor Stinner in branch 'default': Close #17694: Add minimum length to _PyUnicodeWriter http://hg.python.org/cpython/rev/edf029fc9591
msg187211 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月17日 22:30
The commit changes the default value of min_length when overallocation is enabled: it does not use at least 100 characters anymore. It did not directly introduce a bug, but the regression comes from 7ed9993d53b4 (use _PyUnicodeWriter for Unicode decoders). The following commits should fix these issues. changeset: 83435:94d1c3bdb79c tag: tip user: Victor Stinner <victor.stinner@gmail.com> date: Thu Apr 18 00:25:28 2013 +0200 files: Objects/unicodeobject.c description: Fix bug in Unicode decoders related to _PyUnicodeWriter Bug introduced by changesets 7ed9993d53b4 and edf029fc9591. changeset: 83434:7eb52460c999 user: Victor Stinner <victor.stinner@gmail.com> date: Wed Apr 17 23:58:16 2013 +0200 files: Objects/unicodeobject.c description: Fix typo in unicode_decode_call_errorhandler_writer() Bug introduced by changeset 7ed9993d53b4.

History
Date	User	Action	Args
2022年04月11日 14:57:44	admin	set	github: 61894
2013年04月17日 22:30:32	vstinner	set	messages: + msg187211
2013年04月17日 21:05:38	python-dev	set	status: open -> closed nosy: + python-dev messages: + msg187207 resolution: fixed stage: resolved
2013年04月14日 01:59:33	vstinner	set	messages: + msg186878
2013年04月14日 01:54:06	vstinner	set	messages: + msg186877
2013年04月14日 01:41:55	vstinner	set	messages: + msg186876
2013年04月14日 01:38:55	vstinner	set	files: + writer_minlen.patch messages: + msg186875
2013年04月13日 21:49:14	vladistan	set	messages: + msg186859
2013年04月13日 21:42:58	vladistan	set	files: + benchmark.py messages: + msg186858
2013年04月13日 21:41:17	vladistan	set	files: + issue17694.patch nosy: + vladistan messages: + msg186857 keywords: + patch
2013年04月10日 23:53:21	vstinner	create

homepage