1

I have a simple function for tokenizing words.

import re
def tokenize(string):
 return re.split("(\W+)(?<!')",string,re.UNICODE)

In python 2.7 it behaves like this:

In [170]: tokenize('perché.')
Out[170]: ['perch', '\xc3\xa9.', '']

In python 3.5.0 I get this:

In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']

The problem is that 'é' should not be treated as a character to tokenize. I thoght that re.UNICODE could be enough to make \W work in the way I mean?

How to get the same behaviour of python 3.x in python 2.x ?

asked Nov 1, 2015 at 14:54
2
  • Can you try u'perché.' in 2.7? Commented Nov 1, 2015 at 17:27
  • tokenize(u'perché.') -> Out[14]: [u'perch', u'\xe9.']. Same thing as before. Commented Nov 1, 2015 at 18:03

1 Answer 1

2

You'll want to use Unicode strings, but also the third parameter of split is not flags, but maxsplit:

>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
 Split the source string by the occurrences of the pattern,
 returning a list containing the resulting substrings. If
 capturing parentheses are used in pattern, then the text of all
 groups in the pattern are also returned as part of the resulting
 list. If maxsplit is nonzero, at most maxsplit splits occur,
 and the remainder of the string is returned as the final element
 of the list.

Example:

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
 return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)
print(tokenize(u'perché.'))

Output:

C:\>py -2 test.py
[u'perch\xe9', u'.', u'']
C:\>py -3 test.py
['perché', '.', '']
answered Nov 1, 2015 at 18:05
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.