I have a string (g) on which I run a simple regex to find a number. The problem is that somehow the regex doesn't work on that string type (encoding?). A "normal" string works however. What am I missing? Please see below the steps as seen on the repl:
(example is a summary of an online poker tournament)
NOT WORKING:
>>> g
'F\x00u\x00l\x00l\x00 \x00T\x00i\x00l\x00t\x00 \x00P\x00o\x00k\x00e\x00r\x00 \x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00S\x00u\x00m\x00m\x00a\x00r\x00y\x00 \x00$\x002\x00.\x002\x005\x00 \x00H\x00e\x00a\x00d\x00s\x00-\x00U\x00p\x00 \x00S\x00i\x00t\x00 \x00&\x00 \x00G\x00o\x00 \x00(\x002\x005\x000\x005\x005\x005\x009\x001\x004\x00)\x00 \x002\x00-\x007\x00 \x00T\x00r\x00i\x00p\x00l\x00e\x00 \x00D\x00r\x00a\x00w\x00 \x00L\x00i\x00m\x00i\x00t\x00 \x00(\x00T\x00u\x00r\x00b\x00o\x00,\x00 \x00H\x00e\x00a\x00d\x00s\x00 \x00U\x00p\x00)\x00\n\x00B\x00u\x00y\x00-\x00I\x00n\x00:\x00 \x00$\x002\x00.\x001\x002\x00 \x00+\x00 \x00$\x000\x00.\x001\x003\x00\n\x00B\x00u\x00y\x00-\x00I\x00n\x00 \x00C\x00h\x00i\x00p\x00s\x00:\x00 \x001\x005\x000\x000\x00\n\x002\x00 \x00E\x00n\x00t\x00r\x00i\x00e\x00s\x00\n\x00T\x00o\x00t\x00a\x00l\x00 \x00P\x00r\x00i\x00z\x00e\x00 \x00P\x00o\x00o\x00l\x00:\x00 \x00$\x004\x00.\x002\x004\x00\n\x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00s\x00t\x00a\x00r\x00t\x00e\x00d\x00:\x00 \x002\x000\x001\x003\x00/\x000\x003\x00/\x000\x008\x00 \x000\x006\x00:\x000\x000\x00:\x002\x007\x00 \x00E\x00T\x00\n\x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00f\x00i\x00n\x00i\x00s\x00h\x00e\x00d\x00:\x00 \x002\x000\x001\x003\x00/\x000\x003\x00/\x000\x008\x00 \x000\x006\x00:\x001\x004\x00:\x003\x000\x00 \x00E\x00T\x00\n\x00\n\x001\x00:\x00 \x00A\x00n\x00d\x00r\x00e\x00y\x003\x003\x001\x000\x00,\x00 \x00$\x004\x00.\x002\x004\x00\n\x002\x00:\x00 \x00s\x00y\x00n\x00t\x00h\x00e\x00s\x00i\x00i\x00s\x00\n\x00s\x00y\x00n\x00t\x00h\x00e\x00s\x00i\x00i\x00s\x00 \x00f\x00i\x00n\x00i\x00s\x00h\x00e\x00d\x00 \x00i\x00n\x00 \x002\x00n\x00d\x00 \x00p\x00l\x00a\x00c\x00e'
>>> myre = re.compile(u"""\(([0-9]+)\)""",re.UNICODE)
>>> m = myre.search(g)
>>> m.groups()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
WORKING
>>> g="Full Tilt Poker Tournament Summary 2ドル.25 Heads-Up Sit & Go (250555914) 2-7 Triple Draw Limit (Turbo, Heads Up)"
>>> m = myre.search(g)
>>> m.groups()
('250555914',)
1 Answer 1
You have UTF-16 encoded data, albeit without a BOM (Byte Order Mark). Decode to Unicode first before attempting to match the regex:
>>> g[:-1].decode('utf-16-le')
u'Full Tilt Poker Tournament Summary 2ドル.25 Heads-Up Sit & Go (250555914) 2-7 Triple Draw Limit (Turbo, Heads Up)\nBuy-In: 2ドル.12 + 0ドル.13\nBuy-In Chips: 1500\n2 Entries\nTotal Prize Pool: 4ドル.24\nTournament started: 2013年03月08日 06:00:27 ET\nTournament finished: 2013年03月08日 06:14:30 ET\n\n1: Andrey3310, 4ドル.24\n2: synthesiis\nsynthesiis finished in 2nd plac'
>>> myre.search(g[:-1].decode('utf-16-le')).groups()
(u'250555914',)
I had to remove the last byte to make that decode though, a null byte at the end was missing. If you are missing data from the end, you are most likely also missing the data from the start, where the BOM would be located. The BOM tells the decoder what variant of UTF-16 was used to encode (little or big endian), without it we need to explicitly tell Python to decode this as little endian.
If you decode the full data, including the BOM, you could just use .decode('utf-16') instead.
If you are reading this from a file, use codecs.open() instead and have Python decode it to Unicode for you:
import codecs
for line in codecs.open('filename.txt', 'r', encoding='utf16'):
# handle line
because otherwise things like .readlines() splits newlines at the byte level, which are encoded to two bytes just like everything else in UTF-16.
2 Comments
readlines() on a UTF-16-encoded file! Newlines are encoded as two bytes too, and you are splitting the file now. Use codecs.open() and read Unicode data.