Regex unicode in python 2.x vs 3.x

Asked 10 years, 2 months ago

Viewed 407 times

I have a simple function for tokenizing words.

import re
def tokenize(string):
 return re.split("(\W+)(?<!')",string,re.UNICODE)

In python 2.7 it behaves like this:

In [170]: tokenize('perché.')
Out[170]: ['perch', '\xc3\xa9.', '']

In python 3.5.0 I get this:

In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']

The problem is that 'é' should not be treated as a character to tokenize. I thoght that re.UNICODE could be enough to make \W work in the way I mean?

How to get the same behaviour of python 3.x in python 2.x ?

Improve this question

asked Nov 1, 2015 at 14:54

Angelo's user avatar

Angelo

7872 gold badges7 silver badges22 bronze badges

Can you try u'perché.' in 2.7?

SethMMorton
– SethMMorton

2015年11月01日 17:27:21 +00:00
Commented Nov 1, 2015 at 17:27
tokenize(u'perché.') -> Out[14]: [u'perch', u'\xe9.']. Same thing as before.

Angelo
– Angelo

2015年11月01日 18:03:48 +00:00
Commented Nov 1, 2015 at 18:03

Add a comment |

1 Answer 1

Sorted by: Reset to default

You'll want to use Unicode strings, but also the third parameter of split is not flags, but maxsplit:

>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
 Split the source string by the occurrences of the pattern,
 returning a list containing the resulting substrings. If
 capturing parentheses are used in pattern, then the text of all
 groups in the pattern are also returned as part of the resulting
 list. If maxsplit is nonzero, at most maxsplit splits occur,
 and the remainder of the string is returned as the final element
 of the list.

Example:

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
 return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)
print(tokenize(u'perché.'))

Output:

C:\>py -2 test.py
[u'perch\xe9', u'.', u'']
C:\>py -3 test.py
['perché', '.', '']

Improve this answer

answered Nov 1, 2015 at 18:05

Mark Tolonen's user avatar

Mark Tolonen

181k26 gold badges184 silver badges279 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Regex unicode in python 2.x vs 3.x

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related