python regex not working on string encoding

Question 1

I have a string (g) on which I run a simple regex to find a number. The problem is that somehow the regex doesn't work on that string type (encoding?). A "normal" string works however. What am I missing? Please see below the steps as seen on the repl:

(example is a summary of an online poker tournament)

NOT WORKING:

>>> g
'F\x00u\x00l\x00l\x00 \x00T\x00i\x00l\x00t\x00 \x00P\x00o\x00k\x00e\x00r\x00 \x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00S\x00u\x00m\x00m\x00a\x00r\x00y\x00 \x00$\x002\x00.\x002\x005\x00 \x00H\x00e\x00a\x00d\x00s\x00-\x00U\x00p\x00 \x00S\x00i\x00t\x00 \x00&\x00 \x00G\x00o\x00 \x00(\x002\x005\x000\x005\x005\x005\x009\x001\x004\x00)\x00 \x002\x00-\x007\x00 \x00T\x00r\x00i\x00p\x00l\x00e\x00 \x00D\x00r\x00a\x00w\x00 \x00L\x00i\x00m\x00i\x00t\x00 \x00(\x00T\x00u\x00r\x00b\x00o\x00,\x00 \x00H\x00e\x00a\x00d\x00s\x00 \x00U\x00p\x00)\x00\n\x00B\x00u\x00y\x00-\x00I\x00n\x00:\x00 \x00$\x002\x00.\x001\x002\x00 \x00+\x00 \x00$\x000\x00.\x001\x003\x00\n\x00B\x00u\x00y\x00-\x00I\x00n\x00 \x00C\x00h\x00i\x00p\x00s\x00:\x00 \x001\x005\x000\x000\x00\n\x002\x00 \x00E\x00n\x00t\x00r\x00i\x00e\x00s\x00\n\x00T\x00o\x00t\x00a\x00l\x00 \x00P\x00r\x00i\x00z\x00e\x00 \x00P\x00o\x00o\x00l\x00:\x00 \x00$\x004\x00.\x002\x004\x00\n\x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00s\x00t\x00a\x00r\x00t\x00e\x00d\x00:\x00 \x002\x000\x001\x003\x00/\x000\x003\x00/\x000\x008\x00 \x000\x006\x00:\x000\x000\x00:\x002\x007\x00 \x00E\x00T\x00\n\x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00f\x00i\x00n\x00i\x00s\x00h\x00e\x00d\x00:\x00 \x002\x000\x001\x003\x00/\x000\x003\x00/\x000\x008\x00 \x000\x006\x00:\x001\x004\x00:\x003\x000\x00 \x00E\x00T\x00\n\x00\n\x001\x00:\x00 \x00A\x00n\x00d\x00r\x00e\x00y\x003\x003\x001\x000\x00,\x00 \x00$\x004\x00.\x002\x004\x00\n\x002\x00:\x00 \x00s\x00y\x00n\x00t\x00h\x00e\x00s\x00i\x00i\x00s\x00\n\x00s\x00y\x00n\x00t\x00h\x00e\x00s\x00i\x00i\x00s\x00 \x00f\x00i\x00n\x00i\x00s\x00h\x00e\x00d\x00 \x00i\x00n\x00 \x002\x00n\x00d\x00 \x00p\x00l\x00a\x00c\x00e'
>>> myre = re.compile(u"""\(([0-9]+)\)""",re.UNICODE)
>>> m = myre.search(g)
>>> m.groups()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'

WORKING

>>> g="Full Tilt Poker Tournament Summary 2ドル.25 Heads-Up Sit & Go (250555914) 2-7 Triple Draw Limit (Turbo, Heads Up)"
>>> m = myre.search(g)
>>> m.groups()
('250555914',)

Question 2

You have UTF-16 encoded data, albeit without a BOM (Byte Order Mark). Decode to Unicode first before attempting to match the regex:

>>> g[:-1].decode('utf-16-le')
u'Full Tilt Poker Tournament Summary 2ドル.25 Heads-Up Sit & Go (250555914) 2-7 Triple Draw Limit (Turbo, Heads Up)\nBuy-In: 2ドル.12 + 0ドル.13\nBuy-In Chips: 1500\n2 Entries\nTotal Prize Pool: 4ドル.24\nTournament started: 2013年03月08日 06:00:27 ET\nTournament finished: 2013年03月08日 06:14:30 ET\n\n1: Andrey3310, 4ドル.24\n2: synthesiis\nsynthesiis finished in 2nd plac'
>>> myre.search(g[:-1].decode('utf-16-le')).groups()
(u'250555914',)

I had to remove the last byte to make that decode though, a null byte at the end was missing. If you are missing data from the end, you are most likely also missing the data from the start, where the BOM would be located. The BOM tells the decoder what variant of UTF-16 was used to encode (little or big endian), without it we need to explicitly tell Python to decode this as little endian.

If you decode the full data, including the BOM, you could just use .decode('utf-16') instead.

If you are reading this from a file, use codecs.open() instead and have Python decode it to Unicode for you:

import codecs
for line in codecs.open('filename.txt', 'r', encoding='utf16'):
 # handle line

because otherwise things like .readlines() splits newlines at the byte level, which are encoded to two bytes just like everything else in UTF-16.

Question 3

Thank you @Martijn Pieters, I still don't understand why I get this format when I open(file).readlines() instead of the normal utf-16 format. Would you be able to help with that point?

Question 4

Don't use readlines() on a UTF-16-encoded file! Newlines are encoded as two bytes too, and you are splitting the file now. Use codecs.open() and read Unicode data.

Martijn Pieters 1.1m326 gold badges4.2k silver badges3.5k bronze badges · Accepted Answer · 2013-03-11 15:12:27Z

You have UTF-16 encoded data, albeit without a BOM (Byte Order Mark). Decode to Unicode first before attempting to match the regex:

>>> g[:-1].decode('utf-16-le')
u'Full Tilt Poker Tournament Summary 2ドル.25 Heads-Up Sit & Go (250555914) 2-7 Triple Draw Limit (Turbo, Heads Up)\nBuy-In: 2ドル.12 + 0ドル.13\nBuy-In Chips: 1500\n2 Entries\nTotal Prize Pool: 4ドル.24\nTournament started: 2013年03月08日 06:00:27 ET\nTournament finished: 2013年03月08日 06:14:30 ET\n\n1: Andrey3310, 4ドル.24\n2: synthesiis\nsynthesiis finished in 2nd plac'
>>> myre.search(g[:-1].decode('utf-16-le')).groups()
(u'250555914',)

I had to remove the last byte to make that decode though, a null byte at the end was missing. If you are missing data from the end, you are most likely also missing the data from the start, where the BOM would be located. The BOM tells the decoder what variant of UTF-16 was used to encode (little or big endian), without it we need to explicitly tell Python to decode this as little endian.

If you decode the full data, including the BOM, you could just use .decode('utf-16') instead.

If you are reading this from a file, use codecs.open() instead and have Python decode it to Unicode for you:

import codecs
for line in codecs.open('filename.txt', 'r', encoding='utf16'):
 # handle line

because otherwise things like .readlines() splits newlines at the byte level, which are encoded to two bytes just like everything else in UTF-16.

Thank you @Martijn Pieters, I still don't understand why I get this format when I open(file).readlines() instead of the normal utf-16 format. Would you be able to help with that point?
Don't use readlines() on a UTF-16-encoded file! Newlines are encoded as two bytes too, and you are splitting the file now. Use codecs.open() and read Unicode data.

CollectivesTM on Stack Overflow

python regex not working on string encoding

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related