I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1
I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.
How can I force Python to use a UTF string or in some way match a character such as that?
Thanks for your help
-
I deleted the duplicates. Not sure why my browser submitted 3 at once...Teifion– Teifion2008年12月16日 17:50:56 +00:00Commented Dec 16, 2008 at 17:50
-
I find that Python had a very good (not perfect) but very good support for strings in various encoding (e.g. utf-8 (it is an encoding)) as ell as Unicode (Unicode is not an encoding) strings long before Python 3 therefore don't blame language; just ask a question if you don't know how to do ... .jfs– jfs2008年12月16日 18:16:04 +00:00Commented Dec 16, 2008 at 18:16
-
I wanted to pre-empt someone telling me about Python 3 or asking if I was using it. Python 2.5 is still a wonderful language and I prefer it over PHPTeifion– Teifion2008年12月16日 18:32:13 +00:00Commented Dec 16, 2008 at 18:32
3 Answers 3
You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.
So, for example, this:
re.compile("–")
becomes this:
re.compile(u"\u2013")
After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.
# encoding: utf-8
Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6
# encoding: utf-8
import re
x = re.compile("–")
print x.search("xxx–x").start()
Comments
Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.