python unicode regular expressions

Question 1

I read a file with the code below and then I want to find words in the file using re library. The file contains Turkish characters. So I decode file using utf-8. re library doesn't know Turkish character. Below code isn't working.

 text= unicodedata.normalize("NFKD",codecs.open(os.path.abspath("texts/kopru1.txt"),"rb").read().decode("utf-8"))
 text=text.replace("\r\n"," ").lower()
 aa= re.findall(ur"[a-zçşıöü]+", text,re.UNICODE)

Although "ayşe" is a word, this word seems as of "ays" and "e".

Question 2

Could you give some example data and tell us what you want to do?

Question 3

example string is "ayşe kulin köprü". I want to find words in this string.

Question 4

If you want to split by word why not use text.split(" ")?

Question 5

Use the escape sequence \w which means "a letter of any kind." Just getting an example sentence from wikipedia:

>>> text = u'Türkî-i çin (güzel güneş) terkiplerinde de gördüğümüz'
>>> re.findall(r'\w+', text, re.UNICODE)
['Türkî', 'i', 'çin', 'güzel', 'güneş', 'terkiplerinde', 'de', 'gördüğümüz']

Question 6

I had done before. And again I did. But the code isn't still working.

Question 7

@hinzir what does your text variable look like before you try to match on it?

Question 8

@hinzir Oh, right. I've updated my reply with additional hints. See if it helps.

Question 9

@kgr you are wrong about replace method. docs.python.org/2/library/string.html#string.replace "Return a copy of string s with all occurrences of substring old replaced by new. "

Question 10

@hinzir That's really, really weird. When I tested yesterday I could swear it wouldn't return a new string, but now that I try it does.

kqr 15.1k3 gold badges43 silver badges73 bronze badges · Accepted Answer · 2013-06-11 17:03:48Z

5

Use the escape sequence \w which means "a letter of any kind." Just getting an example sentence from wikipedia:

>>> text = u'Türkî-i çin (güzel güneş) terkiplerinde de gördüğümüz'
>>> re.findall(r'\w+', text, re.UNICODE)
['Türkî', 'i', 'çin', 'güzel', 'güneş', 'terkiplerinde', 'de', 'gördüğümüz']

Share

Improve this answer

edited Jun 12, 2013 at 8:11

answered Jun 11, 2013 at 17:03

kqr's user avatar

kqr

15.1k3 gold badges43 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

hinzir

hinzir Over a year ago

I had done before. And again I did. But the code isn't still working.

2013年06月11日T17:23:31.053Z+00:00

kqr

kqr Over a year ago

@hinzir what does your text variable look like before you try to match on it?

2013年06月11日T17:25:53.073Z+00:00

kqr

kqr Over a year ago

@hinzir Oh, right. I've updated my reply with additional hints. See if it helps.

2013年06月11日T19:25:03.093Z+00:00

hinzir

hinzir Over a year ago

@kgr you are wrong about replace method. docs.python.org/2/library/string.html#string.replace "Return a copy of string s with all occurrences of substring old replaced by new. "

2013年06月12日T06:48:10.5Z+00:00

kqr

kqr Over a year ago

@hinzir That's really, really weird. When I tested yesterday I could swear it wouldn't return a new string, but now that I try it does.

2013年06月12日T08:10:57.38Z+00:00

|

CollectivesTM on Stack Overflow

python unicode regular expressions

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related