This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年10月11日 23:14 by pitrou, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| find_max_char.patch | pitrou, 2011年10月11日 23:14 | review | ||
| find_max_char2.patch | pitrou, 2011年10月11日 23:22 | review | ||
| find_max_char3.patch | pitrou, 2011年10月11日 23:32 | review | ||
| find_max_char4.patch | pitrou, 2011年10月11日 23:50 | review | ||
| find_max_char5.patch | vstinner, 2011年10月12日 21:10 | |||
| Messages (12) | |||
|---|---|---|---|
| msg145372 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年10月11日 23:14 | |
This patch optimizes scanning for the max character width in an unicode buffer. Micro-benchmarking some worst case situations: $ ./python -m timeit -s "x='é'+'x'*100000" "x[1:]" -> before: 10000 loops, best of 3: 74.9 usec per loop -> after: 100000 loops, best of 3: 15.5 usec per loop $ ./python -m timeit -s "x='\U00010000'+'x'*100000" "x[1:]" -> before: 10000 loops, best of 3: 138 usec per loop -> after: 10000 loops, best of 3: 71.3 usec per loop |
|||
| msg145373 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年10月11日 23:22 | |
I hadn't noticed the STRINGLIB_SIZEOF_CHAR constant. Reuse it instead of adding STRINGLIB_CHAR_SIZE. |
|||
| msg145374 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年10月11日 23:32 | |
Slightly cleaned up patch after Victor's comments in private. |
|||
| msg145375 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年10月11日 23:42 | |
find_max_char() returns 0x10000 instead of 0x10FFFF, which may be wrong (or at least, surprising). You may add a max_char variable using other macros like MAX_CHAR_ASCII, MAX_CHAR_UCS1, ..., which will be set at the same time than mask. Or restore your if (ret >= 0x10000) return 0x10ffff; else return ret;. Constants look inconsistent: + const STRINGLIB_CHAR *unrolled_end = begin + (n & ~ (Py_ssize_t) 7); + p += 4; You may use STRINGLIB_CHAR in: + Py_UCS4 bits = p[0] | p[1] | p[2] | p[3]; |
|||
| msg145376 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年10月11日 23:50 | |
Ok, updated patch. |
|||
| msg145377 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年10月12日 00:11 | |
> Ok, updated patch. "ret = ~mask + 1;" looks wrong: (~0xFFFFFF80+1) gives 128, not 127. I don't see why you need: + if (ret < 128) + return 127; + if (ret < 256) + return 255; #undef ASCII_CHAR_MASK should be #undef UCS1_ASCII_CHAR_MASK #error Invalid STRINGLIB_SIZEOF_CHAR (must be 1, 2 or 4) should be #error Invalid STRINGLIB_SIZEOF_CHAR (must be 2 or 4) Why do you need these forward declarations? It's maybe related to another patch? static PyObject * +unicode_fromascii(const unsigned char *s, Py_ssize_t size); +static PyObject * +_PyUnicode_FromUCS1(const unsigned char *s, Py_ssize_t size); +static PyObject * +_PyUnicode_FromUCS2(const Py_UCS2 *s, Py_ssize_t size); +static PyObject * +_PyUnicode_FromUCS4(const Py_UCS4 *s, Py_ssize_t size); (You kept Py_UCS4 for "Py_UCS4 bits", but Py_UCS4 is maybe just fine.) By the way, your function rocks :-) Use bit masks is a great idea, especially your "bits = p[0] | p[1] | p[2] | p[3]" "hack". |
|||
| msg145378 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年10月12日 00:24 | |
> > Ok, updated patch. > > "ret = ~mask + 1;" looks wrong: (~0xFFFFFF80+1) gives 128, not 127. That's on purpose, since the mask has just matched. If 0xFFFFFF80 matches, then the max char can't be 127, it has to be at least 128. > I don't see why you need: > > + if (ret < 128) > + return 127; > + if (ret < 256) > + return 255; Because otherwise you're returning 0xFFFF in the following line: if (ret < 0x10000) return 0xFFFF; > #undef ASCII_CHAR_MASK should be #undef UCS1_ASCII_CHAR_MASK Ha, funny that gcc doesn't say anything. I also forgot to #under MASK_*. > #error Invalid STRINGLIB_SIZEOF_CHAR (must be 1, 2 or 4) > should be > #error Invalid STRINGLIB_SIZEOF_CHAR (must be 2 or 4) Well, no. 1 is a legal value, even though it's handled elsewhere, so the error message is right. > Why do you need these forward declarations? It's maybe related to another patch? It's because I'm moving the stringlib #includes up in unicodeobject.c, and stringlib calls these functions. |
|||
| msg145427 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年10月12日 20:21 | |
Without the patch: python3.2 -m timeit 'x="é"+"x"*10000' 'x[1:]' 100000 loops, best of 3: 2.18 usec per loop python3.3 -m timeit 'x="é"+"x"*10000' 'x[1:]' 100000 loops, best of 3: 6.68 usec per loop |
|||
| msg145428 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年10月12日 20:22 | |
With find_max_char4.patch: python3.3 -m timeit 'x="é"+"x"*10000' 'x[1:]' 100000 loops, best of 3: 1.96 usec per loop |
|||
| msg145432 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年10月12日 21:10 | |
find_max_char5.patch: - don't use adjusted ~mask+1: "precompute" the right max_char - rename findwidth.h to find_max_char.h - add some #undef |
|||
| msg145437 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年10月12日 22:08 | |
New changeset 9c7d3207fc15 by Antoine Pitrou in branch 'default': Issue #13155: Optimize finding the optimal character width of an unicode string http://hg.python.org/cpython/rev/9c7d3207fc15 |
|||
| msg145438 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年10月12日 22:08 | |
> find_max_char5.patch: > - don't use adjusted ~mask+1: "precompute" the right max_char > - rename findwidth.h to find_max_char.h > - add some #undef Thank you, I've committed this version. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:22 | admin | set | github: 57364 |
| 2011年10月12日 22:08:58 | pitrou | set | status: open -> closed resolution: fixed messages: + msg145438 stage: resolved |
| 2011年10月12日 22:08:07 | python-dev | set | nosy:
+ python-dev messages: + msg145437 |
| 2011年10月12日 21:10:12 | vstinner | set | files:
+ find_max_char5.patch messages: + msg145432 |
| 2011年10月12日 20:22:36 | vstinner | set | messages: + msg145428 |
| 2011年10月12日 20:21:30 | vstinner | set | messages: + msg145427 |
| 2011年10月12日 00:24:31 | pitrou | set | messages: + msg145378 |
| 2011年10月12日 00:11:44 | vstinner | set | messages: + msg145377 |
| 2011年10月11日 23:50:38 | pitrou | set | files:
+ find_max_char4.patch messages: + msg145376 |
| 2011年10月11日 23:42:17 | vstinner | set | nosy:
+ vstinner messages: + msg145375 |
| 2011年10月11日 23:32:19 | pitrou | set | files:
+ find_max_char3.patch messages: + msg145374 |
| 2011年10月11日 23:22:36 | pitrou | set | files:
+ find_max_char2.patch messages: + msg145373 |
| 2011年10月11日 23:14:18 | pitrou | create | |