1

I'm working on a program where I should reject any code point above U+10FFFF. This seems straightforward enough, except I can't figure out how to represent such a range of code points in my regular expression. I want to do something like this

valid_character = re.compile(u'[\u0000-\u10FFFF]')

and then have anything that doesn't match that be handled appropriately. However, \u only seems to recognize the first four characters, namely 10FF. Is there another way to represent this code point range or handle this situation?

This site recommends u"\U0010FFFF" but that doesn't seem to work either.

>>> ord(u'\U0010FFFF')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
asked Dec 27, 2014 at 16:36
14
  • What does your input look like? Python should, by definition, reject any Unicode "character" above U+10ffff, since they do not exist. Commented Dec 27, 2014 at 16:50
  • 1
    It can't be specified with the \u or \U syntax, since characters above U+10FFFF are not valid Unicode. What is the encoding of your file? Provide a sample with the characters you need to filter. Commented Dec 27, 2014 at 16:56
  • 3
    The original UTF-8 design allows for 5- and 6-byte UTF-8 encodings so it is possible for someone to generate a file with illegal Unicode characters encoded that way. Commented Dec 27, 2014 at 17:05
  • 2
    If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid. Commented Dec 27, 2014 at 17:10
  • 2
    There are no Unicode characters and no Unicode code points beyond U+10FFFF, according to the definitions of the Unicode standard. You should rewrite the question. Commented Dec 27, 2014 at 17:49

1 Answer 1

3

If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid.

Example:

>>> b'\xf4\x8f\xbf\xbf'.decode('utf8')
u'\U0010ffff'
# UTF-8 equivalent to \U00110000...
>>> len(b'\xf4\x90\x80\x80'.decode('utf8'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "D:\dev\Python27\lib\encodings\utf_8.py", line 16, in decode
 return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid continuation byte
answered Dec 27, 2014 at 17:14
Sign up to request clarification or add additional context in comments.

2 Comments

This solves the OP's X-Y problem, so good call on that ... but it left me wondering how to construct OP's regex.
@Jongware, maybe something like in this answer. It finds valid UTF-8 sequences.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.