I'm working on a program where I should reject any code point above U+10FFFF. This seems straightforward enough, except I can't figure out how to represent such a range of code points in my regular expression. I want to do something like this
valid_character = re.compile(u'[\u0000-\u10FFFF]')
and then have anything that doesn't match that be handled appropriately. However, \u only seems to recognize the first four characters, namely 10FF. Is there another way to represent this code point range or handle this situation?
This site recommends u"\U0010FFFF" but that doesn't seem to work either.
>>> ord(u'\U0010FFFF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
1 Answer 1
If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid.
Example:
>>> b'\xf4\x8f\xbf\xbf'.decode('utf8')
u'\U0010ffff'
# UTF-8 equivalent to \U00110000...
>>> len(b'\xf4\x90\x80\x80'.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\dev\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid continuation byte
2 Comments
Explore related questions
See similar questions with these tags.
\uor\Usyntax, since characters aboveU+10FFFFare not valid Unicode. What is the encoding of your file? Provide a sample with the characters you need to filter.