UTF in Python Regex

Asked 17 years ago

Viewed 5k times

I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1

I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.

How can I force Python to use a UTF string or in some way match a character such as that?

Thanks for your help

Improve this question

edited Dec 16, 2008 at 17:50

Greg's user avatar

Greg

323k55 gold badges378 silver badges338 bronze badges

asked Dec 16, 2008 at 17:49

Teifion's user avatar

Teifion

112k76 gold badges165 silver badges196 bronze badges

I deleted the duplicates. Not sure why my browser submitted 3 at once...

Teifion
– Teifion

2008年12月16日 17:50:56 +00:00
Commented Dec 16, 2008 at 17:50
I find that Python had a very good (not perfect) but very good support for strings in various encoding (e.g. utf-8 (it is an encoding)) as ell as Unicode (Unicode is not an encoding) strings long before Python 3 therefore don't blame language; just ask a question if you don't know how to do ... .

jfs
– jfs

2008年12月16日 18:16:04 +00:00
Commented Dec 16, 2008 at 18:16
I wanted to pre-empt someone telling me about Python 3 or asking if I was using it. Python 2.5 is still a wonderful language and I prefer it over PHP

Teifion
– Teifion

2008年12月16日 18:32:13 +00:00
Commented Dec 16, 2008 at 18:32

Add a comment |

3 Answers 3

Sorted by: Reset to default

You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.

So, for example, this:

re.compile("–")

becomes this:

re.compile(u"\u2013")

Improve this answer

edited Dec 16, 2008 at 19:06

answered Dec 16, 2008 at 18:01

Patrick McElhaney's user avatar

Patrick McElhaney

59.6k41 gold badges138 silver badges170 bronze badges

2 Comments

Teifion

Teifion Over a year ago

I was putting an r before the string for raw string

2008年12月16日T18:31:24.89Z+00:00

rlafuente

rlafuente Over a year ago

You can also add 'ur' before the string so that it's raw and Unicode.

2011年05月25日T15:57:51.157Z+00:00

After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.

# encoding: utf-8

Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6

# encoding: utf-8
import re
x = re.compile("–") 
print x.search("xxx–x").start()

Improve this answer

edited Dec 16, 2008 at 18:25

answered Dec 16, 2008 at 18:20

Patrick McElhaney's user avatar

Patrick McElhaney

59.6k41 gold badges138 silver badges170 bronze badges

Comments

Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.

Improve this answer

answered Dec 16, 2008 at 18:27

unbeknown

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

UTF in Python Regex

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related