This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年08月15日 10:06 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| unicode_macros.patch | vstinner, 2011年08月15日 10:06 | review | ||
| Messages (7) | |||
|---|---|---|---|
| msg142108 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年08月15日 10:06 | |
A lot of code is duplicated in unicodeobject.c to manipulate ("encode/decode") surrogates. Each function has from one to three different implementations. The new decode_ucs4() function adds a new implementation. Attached patch replaces this code by macros.
I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE are important for speed. ((ch & 0xFFFFFC00UL) == 0xD800) (from decode_ucs4) is *a little bit* faster than (0xD800 <= ch && ch <= 0xDBFF) on my CPU (Atom Z520 @ 1.3 GHz): running test_unicode 4 times takes ~54 sec instead of ~57 sec (-3%).
These 3 macros have to be checked, I wrote the first one:
#define IS_SURROGATE(ch) (((ch) & 0xFFFFF800UL) == 0xD800)
#define IS_HIGH_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xD800)
#define IS_LOW_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xDC00)
I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if Py_UNICODE is 16 bits (narrow build). It's maybe useless.
#define COMBINE_SURROGATES(ch1, ch2) \
(((((Py_UCS4)(ch1) & 0x3FF) << 10) | ((Py_UCS4)(ch2) & 0x3FF)) + 0x10000)
HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been preproceed to fit in [0; 0xFFFF]. I added this requirement in the comment of these macros. It would be better to have only one macro to do the two operations, but because "*p++" (dereference and increment) is usually used, I prefer to avoid one unique macro (I don't like passing *p++ in a macro using its argument more than once).
Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE.
I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on ch2 on narrow build.
I also added a IS_NONBMP macro just because I prefer macro over hardcoded constants.
|
|||
| msg142109 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年08月15日 10:10 | |
We may use the following unlikely macro for IS_SURROGATE, IS_HIGH_SURROGATE and IS_LOW_SURROGATE: #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) I suppose that we should use microbenchmarks to validate these macros? Should I open a new issue for this idea? |
|||
| msg142111 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年08月15日 10:23 | |
This has been proposed already in #10542 (the issue also has patches). |
|||
| msg142125 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年08月15日 12:11 | |
> HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument > has been preproceed to fit in [0; 0xFFFF]. I added this requirement in > the comment of these macros. The macros should preprocess the argument themselves. It will make the code even simpler. Otherwise +1. |
|||
| msg142148 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年08月15日 21:44 | |
> This has been proposed already in #10542 (the issue also has patches). The two issues are different: this issue is only a refactoring, whereas #10542 adds a new "feature" (function/macro: Py_UNICODE_NEXT). |
|||
| msg142170 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年08月16日 08:19 | |
#10542 proposes the following macros to factor out common code: #define _Py_UNICODE_ISSURROGATE #define _Py_UNICODE_ISHIGHSURROGATE #define _Py_UNICODE_ISLOWSURROGATE #define _Py_UNICODE_JOIN_SURROGATES and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring): #define _Py_UNICODE_NEXT #define _Py_UNICODE_PUT_NEXT Your patch proposes the same 4 macros: #define IS_SURROGATE #define IS_HIGH_SURROGATE #define IS_LOW_SURROGATE #define COMBINE_SURROGATES + 3 additional macros: #define IS_NONBMP #define HIGH_SURROGATE #define LOW_SURROGATE So the two issue looks quite similar to me. |
|||
| msg142174 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2011年08月16日 08:31 | |
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > #10542 proposes the following macros to factor out common code: > #define _Py_UNICODE_ISSURROGATE > #define _Py_UNICODE_ISHIGHSURROGATE > #define _Py_UNICODE_ISLOWSURROGATE > #define _Py_UNICODE_JOIN_SURROGATES > and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring): > #define _Py_UNICODE_NEXT > #define _Py_UNICODE_PUT_NEXT > > Your patch proposes the same 4 macros: > #define IS_SURROGATE > #define IS_HIGH_SURROGATE > #define IS_LOW_SURROGATE > #define COMBINE_SURROGATES > + 3 additional macros: > #define IS_NONBMP > #define HIGH_SURROGATE > #define LOW_SURROGATE > > So the two issue looks quite similar to me. Can we please fold this issue into #10542. We've already had the discussion there. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ 2011年10月04日: PyCon DE 2011, Leipzig, Germany 49 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:20 | admin | set | github: 56960 |
| 2011年08月16日 13:22:49 | benjamin.peterson | set | status: open -> closed resolution: duplicate superseder: Py_UNICODE_NEXT and other macros for surrogates |
| 2011年08月16日 08:31:38 | lemburg | set | messages: + msg142174 |
| 2011年08月16日 08:19:06 | ezio.melotti | set | messages: + msg142170 |
| 2011年08月15日 21:44:59 | vstinner | set | messages: + msg142148 |
| 2011年08月15日 21:39:51 | vstinner | set | nosy:
+ belopolsky |
| 2011年08月15日 12:11:28 | pitrou | set | messages: + msg142125 |
| 2011年08月15日 10:23:32 | ezio.melotti | set | messages: + msg142111 |
| 2011年08月15日 10:10:08 | vstinner | set | messages: + msg142109 |
| 2011年08月15日 10:06:24 | vstinner | create | |