Identify code points above U+10FFFF

Question 1

I'm working on a program where I should reject any code point above U+10FFFF. This seems straightforward enough, except I can't figure out how to represent such a range of code points in my regular expression. I want to do something like this

valid_character = re.compile(u'[\u0000-\u10FFFF]')

and then have anything that doesn't match that be handled appropriately. However, \u only seems to recognize the first four characters, namely 10FF. Is there another way to represent this code point range or handle this situation?

This site recommends u"\U0010FFFF" but that doesn't seem to work either.

>>> ord(u'\U0010FFFF')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

Question 2

What does your input look like? Python should, by definition, reject any Unicode "character" above U+10ffff, since they do not exist.

Question 3

It can't be specified with the \u or \U syntax, since characters above U+10FFFF are not valid Unicode. What is the encoding of your file? Provide a sample with the characters you need to filter.

Question 4

The original UTF-8 design allows for 5- and 6-byte UTF-8 encodings so it is possible for someone to generate a file with illegal Unicode characters encoded that way.

Question 5

If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid.

Question 6

There are no Unicode characters and no Unicode code points beyond U+10FFFF, according to the definitions of the Unicode standard. You should rewrite the question.

Question 7

If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid.

Example:

>>> b'\xf4\x8f\xbf\xbf'.decode('utf8')
u'\U0010ffff'
# UTF-8 equivalent to \U00110000...
>>> len(b'\xf4\x90\x80\x80'.decode('utf8'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "D:\dev\Python27\lib\encodings\utf_8.py", line 16, in decode
 return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid continuation byte

Question 8

This solves the OP's X-Y problem, so good call on that ... but it left me wondering how to construct OP's regex.

Question 9

@Jongware, maybe something like in this answer. It finds valid UTF-8 sequences.

Mark Tolonen 181k26 gold badges184 silver badges279 bronze badges · Accepted Answer · 2014-12-27 17:14:57Z

If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid.

Example:

>>> b'\xf4\x8f\xbf\xbf'.decode('utf8')
u'\U0010ffff'
# UTF-8 equivalent to \U00110000...
>>> len(b'\xf4\x90\x80\x80'.decode('utf8'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "D:\dev\Python27\lib\encodings\utf_8.py", line 16, in decode
 return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid continuation byte

This solves the OP's X-Y problem, so good call on that ... but it left me wondering how to construct OP's regex.
@Jongware, maybe something like in this answer. It finds valid UTF-8 sequences.

CollectivesTM on Stack Overflow

Identify code points above U+10FFFF

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related