1

Pygame and Pyglet are both crashing when I get UCS-4 characters:

exceptions.UnicodeError: A Unicode character above '\uFFFF' was found; not supported

How do I filter all of these characters with regex?

phuclv
43.3k16 gold badges191 silver badges559 bronze badges
asked Mar 25, 2016 at 19:23

2 Answers 2

1

Although your question asks for a regex, it is not the most appropriate tool. You can iterate over each character in your variable use ord(c) > 0xFFFF to detect problematic characters.

But if you require regex, try (python3)

import re
r1 = re.compile("[\U00010000-\U0010FFFF]")
m1 = r1.search( "Text\u00A0\U0001FFFF" )
print (m1.group())
print (m1.start())
print (m1.end())

For python2, just add "u" before the strings literals (to make them unicode).

answered Mar 25, 2016 at 19:32
Sign up to request clarification or add additional context in comments.

3 Comments

Does this handles the surrogate pairs properly?
Surrogate pairs may exist only when the unicode string is encoded (example in UTF16). Therefore, after python has decoded them, they are represented internally as code points. In my example \U0001FFFF would have to be encoded by a surrogate pair in UTF-16 (but not internally in Python). So the answer is yes. Since your question specifies UCS-4, in which all code points are represented by 32 bits, UCS-4 should not have surrogate pairs.
@RyanHope: It doesn't handle surrogates: re.sub('[\U00010000-\U0010FFFF]', '', '\udbff\udfff') != ''. Though (normally) you shouldn't get Unicode strings that contain surrogates. See Python issue 18814: Add utilities to "clean" surrogate code points from strings
0

The font might actually be the real issue here, so I'm not sure what good filtering with regex is going to do you. I would recommend taking a look at the pygame.freetype module, since it does not limit using code points above the range of \uFFFF.

To use the pygame.freetypeEnhanced Pygame module for loading and rendering computer fonts based pygame.ftfont as pygame.fontpygame module for loading and rendering fonts define the enviroment variable PYGAME_FREETYPE before the first import of pygamethe top level pygame package. pygame.ftfont is a pygame.fontpygame module for loading and rendering fonts compatible module that passes all but one of the font module unit tests: it does not have the UCS-2 limitation of the SDL_ttf based font module, so fails to raise an exception for a code point greater than ‘uFFFF’. If pygame.freetypeEnhanced Pygame module for loading and rendering computer fonts is unavailable then the SDL_ttf font module will be loaded instead.

http://www.pygame.org/docs/ref/font.html

answered Mar 25, 2016 at 19:44

2 Comments

If you're looking to filter using regex then this answer might help: stackoverflow.com/a/3220210/499581
Using pygame.freetype does not appear to be a viable solution on OSX or Windows. None of the pygame binary builds include this that I have tried.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.