Pygame and Pyglet are both crashing when I get UCS-4 characters:
exceptions.UnicodeError: A Unicode character above '\uFFFF' was found; not supported
How do I filter all of these characters with regex?
2 Answers 2
Although your question asks for a regex, it is not the most appropriate tool. You can iterate over each character in your variable use ord(c) > 0xFFFF to detect problematic characters.
But if you require regex, try (python3)
import re
r1 = re.compile("[\U00010000-\U0010FFFF]")
m1 = r1.search( "Text\u00A0\U0001FFFF" )
print (m1.group())
print (m1.start())
print (m1.end())
For python2, just add "u" before the strings literals (to make them unicode).
3 Comments
\U0001FFFF would have to be encoded by a surrogate pair in UTF-16 (but not internally in Python). So the answer is yes. Since your question specifies UCS-4, in which all code points are represented by 32 bits, UCS-4 should not have surrogate pairs.re.sub('[\U00010000-\U0010FFFF]', '', '\udbff\udfff') != ''. Though (normally) you shouldn't get Unicode strings that contain surrogates. See Python issue 18814: Add utilities to "clean" surrogate code points from strings The font might actually be the real issue here, so I'm not sure what good filtering with regex is going to do you. I would recommend taking a look at the pygame.freetype module, since it does not limit using code points above the range of \uFFFF.
To use the pygame.freetypeEnhanced Pygame module for loading and rendering computer fonts based pygame.ftfont as pygame.fontpygame module for loading and rendering fonts define the enviroment variable PYGAME_FREETYPE before the first import of pygamethe top level pygame package. pygame.ftfont is a pygame.fontpygame module for loading and rendering fonts compatible module that passes all but one of the font module unit tests: it does not have the UCS-2 limitation of the SDL_ttf based font module, so fails to raise an exception for a code point greater than ‘uFFFF’. If pygame.freetypeEnhanced Pygame module for loading and rendering computer fonts is unavailable then the SDL_ttf font module will be loaded instead.