homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: makeunicodedata.py does not support Unihan digit data
Type: Stage:
Components: Unicode Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, lemburg, loewis
Priority: normal Keywords:

Created on 2010年11月29日 11:10 by lemburg, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Messages (13)
msg122786 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年11月29日 11:10
The script only patches numeric data into the table (field 8), but does not update the digit field (field 7).
As a result, ideographs used for Chinese digits are not recognized as digits and not evaluated by int(), long() and float():
 http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture
>>> unicode('三', 'utf-8')
u'\u4e09'
>>> int(unicode('三', 'utf-8'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u4e09' in position 0: invalid decimal Unicode string
> <stdin>(1)<module>()
>>> import unicodedata
>>> unicodedata.digit(unicode('三', 'utf-8'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: not a digit
The code point refers to the digit 3.
msg122809 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年11月29日 15:15
The code point is also not listed as decimal digit (relevant for the int() decimal parsing):
>>> unicodedata.decimal(unicode('三', 'utf-8'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: not a decimal
This is the relevant part of the script:
 for line in open(unihan):
 if not line.startswith('U+'):
 continue
 code, tag, value = line.split(None, 3)[:3]
 if tag not in ('kAccountingNumeric', 'kPrimaryNumeric',
 'kOtherNumeric'):
 continue
 value = value.strip().replace(',', '')
 i = int(code[2:], 16)
 # Patch the numeric field
 if table[i] is not None:
 table[i][8] = value
The decimal column is not set for code points that have a kPrimaryNumeric value set. Position table[i][8] refers to the
numeric database entry, which correctly gives:
>>> unicodedata.numeric(unicode('三', 'utf-8'))
3.0
msg122811 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年11月29日 15:16
Here's a quick overview of the fields that are set for U+4E09:
http://www.fileformat.info/info/unicode/char/4e09/index.htm 
msg122812 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年11月29日 15:17
This is the definition of kPrimaryNumeric
http://ftp.lanet.lv/ftp/mirror/unicode/5.0.0/ucd/Unihan.html#kPrimaryNumeric 
msg122827 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月29日 16:45
I am adding #10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.
I am also not sure whether this is a bug or a feature request. Martin?
msg122839 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年11月29日 18:29
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> I am adding #10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.
> 
> I am also not sure whether this is a bug or a feature request. Martin?
I consider this a bug (which is why I added Python 2.7 to the list
of versions), since those code points need to be mapped to decimal
and digit as well (see the references I posted; and compare ).
Both Chinese and Japanese use the 4E00 ff. code points as decimal
code points.
msg122851 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月29日 19:04
On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
>
> I consider this a bug (which is why I added Python 2.7 to the list
> of versions), since those code points need to be mapped to decimal
> and digit as well (see the references I posted; and compare ).
>
I don't disagree. However using Unicode 5.2.0 instead of the latest
6.0.0 may be considered a bug as well. The practical issue is whether
to maintain two separate versions of Tools/unicode for 3.x and 2.7 or
merge 3.x changes back to 2.7 and support 3.x using 2to3. Another
option is to simply use only 2.7 (or only 3.x) with Tools/unicode and
maintain control the differences between 2.7 and 3.x using a command
line switch.
msg122859 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010年11月29日 19:52
> I am adding #10552 as a dependency because I think we should fix
> unicode data generation in 3.x before adding new features to the
> scripts.
> 
> I am also not sure whether this is a bug or a feature request.
> Martin?
I fail to see the relevance of gencodec to this issue (and, as
you see in my comment to #10552, I very much fail to see the relevance
of that issue, or of gencodec in the first place).
msg122862 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010年11月29日 20:10
This is not a bug, see
http://www.unicode.org/reports/tr44/#Numeric_Value
Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see
 http://www.unicode.org/reports/tr44/#Numeric_Type_Han
Therefore, it is correct that digit() raises a ValueError for U+4e09.
msg122863 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年11月29日 20:12
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
> ..
>>
>> I consider this a bug (which is why I added Python 2.7 to the list
>> of versions), since those code points need to be mapped to decimal
>> and digit as well (see the references I posted; and compare ).
>>
> 
> I don't disagree. However using Unicode 5.2.0 instead of the latest
> 6.0.0 may be considered a bug as well. 
No, since we only ever change the UCD version once per Python
release.
Note that those standard don't have a version number just for the
fun of it. Each version is a standard of its own and only
patch level updates will go into it.
It's not a bug to stick to an older UCD version.
> The practical issue is whether
> to maintain two separate versions of Tools/unicode for 3.x and 2.7 or
> merge 3.x changes back to 2.7 and support 3.x using 2to3. Another
> option is to simply use only 2.7 (or only 3.x) with Tools/unicode and
> maintain control the differences between 2.7 and 3.x using a command
> line switch.
I'm not sure whether the effort is worth it. We don't run those
tools often enough to invest much time into keeping them in sync
between 2.x and 3.x.
msg122866 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月29日 20:22
> I fail to see the relevance of gencodec to this issue ...
Thanks for the explanation. I wrongly assumed that "make all" is the way to regenerate both unicodedata and the encodings and that the two are interdependent.
msg122867 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年11月29日 20:42
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> This is not a bug, see
> 
> http://www.unicode.org/reports/tr44/#Numeric_Value
> 
> Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see
> 
> http://www.unicode.org/reports/tr44/#Numeric_Type_Han
> 
> Therefore, it is correct that digit() raises a ValueError for U+4e09.
You're right. I guess this is a bug in the UCD or TR44/TR38 itself.
It looks like the numeric properties are not separated in the
Unihan database in the same way they are for the standard UCD.
Unihan separates based on usage context, whereas UCS takes
a parsing approach.
msg122868 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010年11月29日 20:42
> Thanks for the explanation. I wrongly assumed that "make all" is the
> way to regenerate both unicodedata and the encodings and that the two
> are interdependent.
Ah. I never use the Makefile.
History
Date User Action Args
2022年04月11日 14:57:09adminsetgithub: 54784
2010年11月29日 20:46:24loewissetstatus: open -> closed
resolution: not a bug
2010年11月29日 20:42:58loewissetmessages: + msg122868
2010年11月29日 20:42:30lemburgsetmessages: + msg122867
2010年11月29日 20:22:31belopolskysetdependencies: - Tools/unicode/gencodec.py error
messages: + msg122866
2010年11月29日 20:12:50lemburgsetmessages: + msg122863
2010年11月29日 20:10:55loewissetmessages: + msg122862
2010年11月29日 19:52:15loewissetmessages: + msg122859
2010年11月29日 19:04:54belopolskysetmessages: + msg122851
2010年11月29日 18:29:00lemburgsetmessages: + msg122839
2010年11月29日 16:49:02ezio.melottisetnosy: + ezio.melotti
2010年11月29日 16:45:33belopolskysetnosy: + loewis, belopolsky
dependencies: + Tools/unicode/gencodec.py error
messages: + msg122827
2010年11月29日 15:17:22lemburgsetmessages: + msg122812
2010年11月29日 15:16:14lemburgsetmessages: + msg122811
2010年11月29日 15:15:36lemburgsetmessages: + msg122809
2010年11月29日 11:10:54lemburgcreate

AltStyle によって変換されたページ (->オリジナル) /