This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年01月03日 17:09 by flox, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
benchmark_split_replace.diff | flox, 2010年01月04日 15:25 | benchmark results | ||
issue7622_test_splitlines.diff | flox, 2010年01月05日 07:36 | |||
stringlib_split_replace_v4c.diff | flox, 2010年01月05日 21:43 | Patch, apply to trunk | ||
stringlib_split_replace_py3k.diff | flox, 2010年01月05日 21:50 | Patch, apply to py3k |
Messages (24) | |||
---|---|---|---|
msg97168 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月03日 17:09 | |
Content of the patch: - removed code duplication between bytearray/string/unicode - new header "stringlib/split.h" with common methods: stringlib_split/_rsplit/_splitlines - added "maxcount" argument to "stringlib_count" - better performance for split/replace unicode methods Benchmark coming soon... |
|||
msg97172 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年01月03日 17:51 | |
The "split.h" file is missing from your patch. |
|||
msg97173 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月03日 18:00 | |
You're right. Oups. |
|||
msg97174 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年01月03日 18:13 | |
The patch looks wrong for bytearrays. They are mutable, so you shouldn't return the original object as an optimization. Here is the current (unpatched) behaviour: >>> a = bytearray(b"abc") >>> b, = a.split() >>> b is a False On the other hand, you aren't doing this optimization at all in the general case of stringlib_split() and stringlib_rsplit(), while it could be done. |
|||
msg97184 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月03日 23:47 | |
Mutable methods split() splitlines() and partition() fixed. And added optimization for all immutables methods. |
|||
msg97194 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 07:35 | |
added "Makefile.pre.in". |
|||
msg97197 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 08:59 | |
There's some reference leaking somewhere... Will investigate. ~ $ ./python Lib/test/regrtest.py -R 2:3: test_unicode test_unicode leaked [7, 7, 7] references, sum=21 |
|||
msg97204 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 12:21 | |
Refleak fixed in PyUnicode_Splitlines. |
|||
msg97208 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年01月04日 13:02 | |
A few comments on coding style: * please keep the existing argument formats as they are, e.g. count = countstring(self_s, self_len, from_s, from_len, 0, self_len, FORWARD, maxcount); or /* helper macro to fixup start/end slice values */ -#define FIX_START_END(obj) \ - if (start < 0) \ - start += (obj)->length; \ - if (start < 0) \ - start = 0; \ - if (end > (obj)->length) \ - end = (obj)->length; \ - if (end < 0) \ - end += (obj)->length; \ - if (end < 0) \ - end = 0; +#define ADJUST_INDICES(start, end, len) \ + if (end > len) end = len; \ + else if (end < 0) { end += len; if (end < 0) end = 0; } \ + if (start < 0) { start += len; if (start < 0) start = 0; } and use similar formatting for the replacement function calls/macros * make sure that the name of a symbol matches the value, e.g. #define LONG_BITMASK (LONG_BIT-1) #define BLOOM(mask, ch) ((mask & (1 << ((ch) & LONG_BITMASK)))) LONG_BITMASK has a value of 0x1f (31) - that's a single byte, not a long value. In this case, 0x1f is an implementation detail of the simplified Bloom filter used for set membership tests in the Unicode implementation. When adjusting the value to be platform dependent, please check that the implementation does work for platforms that have more than 31 bits available for (signed) longs. Note that you don't need to expose that value separately if you stick to using BLOOM() directly. * use BLOOM() macro in fastsearch.c * when declaring variables with initial values, keep these on separate lines, e.g. don't use this style: Py_ssize_t i, j, count=0; PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)), *sub; instead, write: Py_ssize_t i, j; Py_ssize_t count=0; PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)) PyObject *sub; * always place variable declarations at the top of a function, not into the function body: +stringlib_split( + PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len, + const STRINGLIB_CHAR* sep, Py_ssize_t sep_len, Py_ssize_t maxcount + ) +{ + if (sep_len == 0) { + PyErr_SetString(PyExc_ValueError, "empty separator"); + return NULL; + } + else if (sep_len == 1) + return stringlib_split_char(str_obj, str, str_len, sep[0], maxcount); + + Py_ssize_t i, j, pos, count=0; + PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)), *sub; + if (list == NULL) + return NULL; * function declarations should not put parameters on new lines: +stringlib_splitlines( + PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len, + int keepends + ) +{ instead use this style: -static -PyObject *rsplit_substring(PyUnicodeObject *self, - PyObject *list, - PyUnicodeObject *substring, - Py_ssize_t maxcount) -{ |
|||
msg97211 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 13:54 | |
> A few comments on coding style: Thank you for your remarks. I will update the patch accordingly. > * make sure that the name of a symbol matches the value, e.g. > > #define LONG_BITMASK (LONG_BIT-1) > #define BLOOM(mask, ch) ((mask & (1 << ((ch) & LONG_BITMASK)))) > > LONG_BITMASK has a value of 0x1f (31) - that's a single byte, not > a long value. In this case, 0x1f is an implementation detail of > the simplified Bloom filter used for set membership tests in the > Unicode implementation. > > When adjusting the value to be platform dependent, please check > that the implementation does work for platforms that have > more than 31 bits available for (signed) longs. > > Note that you don't need to expose that value separately if > you stick to using BLOOM() directly. Since the same value is used to build the mask, I assume it's better to keep the value around (or use (LONG_BIT-1) directly?). mask |= (1 << (ptr[i] & LONG_BITMASK)); s/LONG_BITMASK/BLOOM_BITMASK/ is not confusing? |
|||
msg97212 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 14:46 | |
> * function declarations should not put parameters on new lines: > > +stringlib_splitlines( > + PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len, > + int keepends > + ) > +{ I copied the style of "stringlib/partition.h" for this part. Should I update style of "partition.h" too? |
|||
msg97213 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年01月04日 14:54 | |
> I copied the style of "stringlib/partition.h" for this part. > Should I update style of "partition.h" too? No, it's ok for stringlib to have its own consistent style and there's no reason to change it IMO. More interesting would be benchmark results showing how much this improves the various methods :-) |
|||
msg97214 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 15:15 | |
Patch updated: * coding style * added macros BLOOM_ADD to unicodeobject.c and fastsearch.h (and removed LONG_BITMASK) |
|||
msg97215 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 15:25 | |
And now, the figures. There's no gain for the string methods. Some unicode methods are faster: split/rsplit/replace: Most significant results: --- bench_slow.log Trunk +++ bench_fast.log Patched string unicode (ms) (ms) comment ========== late match, 100 characters -13.30 20.51 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100) -16.12 29.88 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100) +13.27 14.38 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100) +16.19 17.61 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100) ========== quick replace multiple character match -4.51 159.78 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*100) +3.67 7.30 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*100) ========== quick replace single character match -3.73 50.61 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*100) +3.72 7.18 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*100) (full benchmark diff is attached) And we save 1000 lines of code cumulated (stringobject.c/unicodeobject.c/bytearrayobject.c) |
|||
msg97216 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年01月04日 15:41 | |
I don't think you should remove such blocks: -/* - Local variables: - c-basic-offset: 4 - indent-tabs-mode: nil - End: -*/ There probably are people relying on them :-) |
|||
msg97218 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年01月04日 15:55 | |
Florent Xicluna wrote: > > > > Florent Xicluna <laxyf@yahoo.fr> added the comment: > > >> >> * function declarations should not put parameters on new lines: >> >> >> >> +stringlib_splitlines( >> >> + PyObject* str_obj, const STRINGLIB_CHAR* str, Py_ssize_t str_len, >> >> + int keepends >> >> + ) >> >> +{ > > > > I copied the style of "stringlib/partition.h" for this part. > > Should I update style of "partition.h" too? I'd prefer if you change the coding style to what we use elsewhere in Python C code. See http://www.python.org/dev/peps/pep-0007/ for more C coding style suggestions. |
|||
msg97219 - (view) | Author: Eric V. Smith (eric.smith) * (Python committer) | Date: 2010年01月04日 16:25 | |
I think we should use whatever style is currently being used in the code. If we want to go back through this code (or any other code) and PEP7-ify it, that should be a separate task. Alternately, we could PEP7-ify it first, then apply these changes. |
|||
msg97220 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年01月04日 16:53 | |
Eric Smith wrote: > > > > Eric Smith <eric@trueblade.com> added the comment: > > > > I think we should use whatever style is currently being used in the code. If we want to go back through this code (or any other code) and PEP7-ify it, that should be a separate task. > > > > Alternately, we could PEP7-ify it first, then apply these changes. For any new files added, PEP 7 should always be used. For PEP7-ifying the existing code, we could open a separate ticket or just apply the change as separate patch. |
|||
msg97224 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 21:52 | |
Fixed a problem with the splitlines optimization: use PyList_Append instead of PyList_SET_ITEM because there's no preallocated list in this case. |
|||
msg97232 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月04日 23:12 | |
The test case for the previous issue. |
|||
msg97267 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年01月05日 17:26 | |
This looks generally good. Can you produce a separate patch for py3k? stringobject.c has been replaced with bytesobject.c there. |
|||
msg97280 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月05日 21:43 | |
Slight update: * Objects/unicodeobject.c - moved STRINGLIB_ISLINEBREAK to unicodedefs.h - removed FROM_UNICODE: use STRINGLIB_IS_UNICODE instead * Objects/stringlib/find.h - use STRINGLIB_WANT_CONTAINS_OBJ in find.h (similar to current py3k impl.) |
|||
msg97281 - (view) | Author: Florent Xicluna (flox) * (Python committer) | Date: 2010年01月05日 21:50 | |
And the Py3k patch. (note: previous update v4b -> v4c minimize the differences between Py2 and Py3 implementations) |
|||
msg97698 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年01月13日 08:09 | |
Committed in r77461 (trunk), r77462 (py3k). Thank you very much! |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022年04月11日 14:56:56 | admin | set | github: 51871 |
2010年04月01日 11:26:43 | flox | link | issue1613130 superseder |
2010年01月13日 08:09:56 | pitrou | set | status: open -> closed resolution: fixed messages: + msg97698 |
2010年01月09日 23:51:24 | flox | link | issue7607 superseder |
2010年01月05日 21:50:26 | flox | set | files:
+ stringlib_split_replace_py3k.diff messages: + msg97281 versions: + Python 3.2 |
2010年01月05日 21:43:08 | flox | set | files:
+ stringlib_split_replace_v4c.diff messages: + msg97280 |
2010年01月05日 21:40:13 | flox | set | files: - stringlib_split_replace_v4b.diff |
2010年01月05日 17:26:23 | pitrou | set | messages: + msg97267 |
2010年01月05日 07:36:08 | flox | set | files: + issue7622_test_splitlines.diff |
2010年01月05日 07:35:39 | flox | set | files: - issue7622_test_splitlines.diff |
2010年01月04日 23:12:18 | flox | set | files:
+ issue7622_test_splitlines.diff messages: + msg97232 |
2010年01月04日 22:24:02 | flox | set | files: - stringlib_split_replace_v4.diff |
2010年01月04日 21:52:45 | flox | set | files:
+ stringlib_split_replace_v4b.diff messages: + msg97224 |
2010年01月04日 16:53:24 | lemburg | set | messages: + msg97220 |
2010年01月04日 16:25:58 | eric.smith | set | messages: + msg97219 |
2010年01月04日 15:55:29 | lemburg | set | messages: + msg97218 |
2010年01月04日 15:41:34 | pitrou | set | messages: + msg97216 |
2010年01月04日 15:25:50 | flox | set | files:
+ benchmark_split_replace.diff messages: + msg97215 |
2010年01月04日 15:15:50 | flox | set | files: - stringlib_split_replace_v3c.diff |
2010年01月04日 15:15:27 | flox | set | files:
+ stringlib_split_replace_v4.diff messages: + msg97214 |
2010年01月04日 14:54:36 | pitrou | set | messages: + msg97213 |
2010年01月04日 14:46:01 | flox | set | messages: + msg97212 |
2010年01月04日 13:54:52 | flox | set | messages: + msg97211 |
2010年01月04日 13:14:37 | eric.smith | set | nosy:
+ eric.smith |
2010年01月04日 13:03:02 | lemburg | set | nosy:
+ lemburg messages: + msg97208 components: + Unicode |
2010年01月04日 12:22:03 | flox | set | files: - stringlib_split_replace_v3b.diff |
2010年01月04日 12:21:56 | flox | set | files:
+ stringlib_split_replace_v3c.diff messages: + msg97204 stage: patch review |
2010年01月04日 08:59:15 | flox | set | messages: + msg97197 |
2010年01月04日 07:36:18 | flox | set | files: - stringlib_split_replace_v3.diff |
2010年01月04日 07:36:05 | flox | set | files:
+ stringlib_split_replace_v3b.diff messages: + msg97194 |
2010年01月03日 23:48:00 | flox | set | files: - stringlib_split_replace_v2.diff |
2010年01月03日 23:47:16 | flox | set | files:
+ stringlib_split_replace_v3.diff messages: + msg97184 |
2010年01月03日 21:53:46 | ezio.melotti | set | priority: normal nosy: + ezio.melotti |
2010年01月03日 18:13:42 | pitrou | set | messages: + msg97174 |
2010年01月03日 18:00:40 | flox | set | files:
+ stringlib_split_replace_v2.diff messages: + msg97173 |
2010年01月03日 17:59:17 | flox | set | files: - stringlib_split_replace.diff |
2010年01月03日 17:51:12 | pitrou | set | nosy:
+ pitrou messages: + msg97172 |
2010年01月03日 17:09:44 | flox | create |