I have a simple function for tokenizing words.
import re
def tokenize(string):
return re.split("(\W+)(?<!')",string,re.UNICODE)
In python 2.7 it behaves like this:
In [170]: tokenize('perché.')
Out[170]: ['perch', '\xc3\xa9.', '']
In python 3.5.0 I get this:
In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']
The problem is that 'é' should not be treated as a character to tokenize. I thoght that re.UNICODE could be enough to make \W work in the way I mean?
How to get the same behaviour of python 3.x in python 2.x ?
asked Nov 1, 2015 at 14:54
Angelo
7872 gold badges7 silver badges22 bronze badges
1 Answer 1
You'll want to use Unicode strings, but also the third parameter of split is not flags, but maxsplit:
>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
Example:
#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)
print(tokenize(u'perché.'))
Output:
C:\>py -2 test.py
[u'perch\xe9', u'.', u'']
C:\>py -3 test.py
['perché', '.', '']
answered Nov 1, 2015 at 18:05
Mark Tolonen
181k26 gold badges184 silver badges279 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
lang-py
u'perché.'in 2.7?