homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Simplify Py_CHARMASK
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, lemburg, loewis, pitrou, skrah
Priority: normal Keywords: needs review, patch

Created on 2010年06月20日 13:12 by skrah, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
py_charmask.patch skrah, 2010年06月20日 13:12
py_charmask2.patch skrah, 2010年07月17日 17:40
unicodeobject_py_charmask.patch skrah, 2010年07月19日 16:53
Messages (12)
msg108234 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年06月20日 13:12
This feature request is extracted from issue 9020, where Py_CHARMASK(c)
was used on EOF, producing different results depending on whether __CHAR_UNSIGNED__ is defined or not.
The preferred fix for issue 9020 is to check for EOF before using
Py_CHARMASK, so this is only loosely related.
I've looked through the source tree to determine how Py_CHARMASK
is meant to be used. It seems that it is only used in situations
where one would actually want to cast a char to an unsigned char,
like isspace((unsigned char)c).
Simplifications:
1) Python.h already enforces that an unsigned char is 8 bit wide. Thus,
 ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
 the same results.
2) There is no reason not to do the cast when __CHAR_UNSIGNED__ is
 defined (it will be a no-op).
msg108235 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年06月20日 13:27
> Thus,
> ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
> the same results.
If it's the case, it's probably optimized away by the compiler anyway.
> There is no reason not to do the cast when __CHAR_UNSIGNED__ is
> defined (it will be a no-op).
Agreed. It also reduces the opportunity for bugs :)
msg108513 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年06月24日 14:47
Antoine Pitrou <report@bugs.python.org> wrote:
> > Thus,
> > ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
> > the same results.
> 
> If it's the case, it's probably optimized away by the compiler anyway.
Yes, the asm output is the same. I was more concerned about readability.
Historically, Py_CHARMASK started as ((c) & 0xff), hence the name. In r34974
UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added.
> > There is no reason not to do the cast when __CHAR_UNSIGNED__ is
> > defined (it will be a no-op).
> 
> Agreed. It also reduces the opportunity for bugs :)
Ok, let's say we use ((unsigned char)((c) & 0xff)) also for __CHAR_UNSIGNED__.
What should the comment say about the intended argument? I've examined all
occurrences in the tree and almost always Py_CHARMASK is used with either
char arguments or int arguments that are directly derived from a char, so
no EOF issues arise.
Exceptions:
===========
Index: Objects/unicodeobject.c
===================================================================
--- Objects/unicodeobject.c (revision 82192)
+++ Objects/unicodeobject.c (working copy)
@@ -8417,6 +8417,7 @@
 else if (c >= '0' && c <= '9') {
 prec = c - '0';
 while (--fmtcnt >= 0) {
+ /* XXX: c and *fmt are Py_UNICODE */
 c = Py_CHARMASK(*fmt++);
 if (c < '0' || c > '9')
 break;
Index: Modules/_json.c
===================================================================
--- Modules/_json.c (revision 82192)
+++ Modules/_json.c (working copy)
@@ -603,6 +603,7 @@
 }
 }
 else {
+ /* XXX: c is Py_UNICODE */
 char c_char = Py_CHARMASK(c);
 chunk = PyString_FromStringAndSize(&c_char, 1);
 if (chunk == NULL) {
msg108514 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年06月24日 14:53
> In r61936 the cast was added.
Actually r61987 
msg108521 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年06月24日 15:25
Stefan Krah wrote:
> 
> Stefan Krah <stefan-usenet@bytereef.org> added the comment:
> 
> Antoine Pitrou <report@bugs.python.org> wrote:
>>> Thus,
>>> ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
>>> the same results.
>>
>> If it's the case, it's probably optimized away by the compiler anyway.
> 
> Yes, the asm output is the same. I was more concerned about readability.
> 
> Historically, Py_CHARMASK started as ((c) & 0xff), hence the name. In r34974
> UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added.
> 
>>> There is no reason not to do the cast when __CHAR_UNSIGNED__ is
>>> defined (it will be a no-op).
>>
>> Agreed. It also reduces the opportunity for bugs :)
> 
> Ok, let's say we use ((unsigned char)((c) & 0xff)) also for __CHAR_UNSIGNED__.
> 
> What should the comment say about the intended argument? I've examined all
> occurrences in the tree and almost always Py_CHARMASK is used with either
> char arguments or int arguments that are directly derived from a char, so
> no EOF issues arise.
Why not just make the casts in those cases explicit instead of
using Py_CHARMASK ?
> Exceptions:
> ===========
> 
> Index: Objects/unicodeobject.c
> ===================================================================
> --- Objects/unicodeobject.c (revision 82192)
> +++ Objects/unicodeobject.c (working copy)
> @@ -8417,6 +8417,7 @@
> else if (c >= '0' && c <= '9') {
> prec = c - '0';
> while (--fmtcnt >= 0) {
> + /* XXX: c and *fmt are Py_UNICODE */
> c = Py_CHARMASK(*fmt++);
If both are Py_UNICODE, why do you need the cast at all ?
Note that the above use also appears to be wrong, since if *fmt
is Py_UNICODE, the following if() will also match if the higher
order bytes of the Py_UNICODE value are non-0.
> if (c < '0' || c > '9')
> break;
> 
> Index: Modules/_json.c
> ===================================================================
> --- Modules/_json.c (revision 82192)
> +++ Modules/_json.c (working copy)
> @@ -603,6 +603,7 @@
> }
> }
> else {
> + /* XXX: c is Py_UNICODE */
> char c_char = Py_CHARMASK(c);
In this case you should get a compiler warning due to the
different signedness of the arguments.
Same comment as above: the high order bytes of c are lost
in this conversion.
Why can't _json.c use one of the existing Unicode conversion
APIs ?
> chunk = PyString_FromStringAndSize(&c_char, 1);
> if (chunk == NULL) {
> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue9036>
> _______________________________________
> _______________________________________________
> Python-bugs-list mailing list
> Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com 
msg108528 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年06月24日 16:27
> Ok, let's say we use ((unsigned char)((c) & 0xff)) also for
> __CHAR_UNSIGNED__.
> 
> What should the comment say about the intended argument?
That it's either in [-128; 127] or in [0; 255] ?
> Index: Objects/unicodeobject.c
> ===================================================================
> --- Objects/unicodeobject.c (revision 82192)
> +++ Objects/unicodeobject.c (working copy)
> @@ -8417,6 +8417,7 @@
> else if (c >= '0' && c <= '9') {
> prec = c - '0';
> while (--fmtcnt >= 0) {
> + /* XXX: c and *fmt are Py_UNICODE */
> c = Py_CHARMASK(*fmt++);
This is a funny bug:
>>> u"%.1\u1032f" % (1./3)
u'0.333333333333'
> Index: Modules/_json.c
> ===================================================================
> --- Modules/_json.c (revision 82192)
> +++ Modules/_json.c (working copy)
> @@ -603,6 +603,7 @@
> }
> }
> else {
> + /* XXX: c is Py_UNICODE */
> char c_char = Py_CHARMASK(c);
This block can only be entered if c <= 0x7f (`has_unicode` is false), so it's not a problem.
msg110585 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年07月17日 17:40
[Marc-Andre]
> Why not just make the casts in those cases explicit instead of
> using Py_CHARMASK ?
I agree that this would be the cleanest solution. It's harder to get
someone to review a larger patch though.
Antoine, I understood that you would prefer to leave the mask. Could I
apply the second version of the patch?
It will probably have an immediate benefit. On the ARM buildbot char
is unsigned and gcc emits a whole page of spurious 'array subscript has
type 'char' warnings.
msg110728 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年07月19日 10:37
> Antoine, I understood that you would prefer to leave the mask. Could I
> apply the second version of the patch?
It looks ok to me.
msg110780 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年07月19日 16:53
Antoine, thanks! Committed in r82966, r82968, r82969 and r82970.
Could we fix the unicodeobject.c bug on the fly? I think the patch should
do, unless non-ascii digits are supported. But in that case several other
changes would be needed.
msg110783 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年07月19日 17:11
Eric, is the unicode patch ok for you?
msg110785 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010年07月19日 17:20
Yes, the unicode patch looks okay to me.
msg110891 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年07月20日 12:13
Antoine, Eric, thanks for looking at the patch. unicodeobject.c patch
committed in r82978, r82979, r82980 and r82984.
As Antoine pointed out, the _json.c issue is not a problem. Also, it is
not present in py3k. The reliable buildbots look ok, so I think we can
close this.
History
Date User Action Args
2022年04月11日 14:57:02adminsetgithub: 53282
2010年07月20日 12:13:36skrahsetstatus: open -> closed
resolution: fixed
messages: + msg110891

stage: patch review -> resolved
2010年07月19日 17:20:54eric.smithsetmessages: + msg110785
2010年07月19日 17:11:15pitrousetmessages: + msg110783
2010年07月19日 16:53:52skrahsetfiles: + unicodeobject_py_charmask.patch

messages: + msg110780
2010年07月19日 10:37:22pitrousetmessages: + msg110728
2010年07月17日 17:40:27skrahsetfiles: + py_charmask2.patch

messages: + msg110585
2010年06月24日 16:27:59pitrousetmessages: + msg108528
2010年06月24日 15:25:04lemburgsetnosy: + lemburg
messages: + msg108521
2010年06月24日 14:53:22skrahsetmessages: + msg108514
2010年06月24日 14:47:29skrahsetmessages: + msg108513
2010年06月20日 13:27:02pitrousetmessages: + msg108235
2010年06月20日 13:12:06skrahcreate

AltStyle によって変換されたページ (->オリジナル) /