I read a file with the code below and then I want to find words in the file using re library. The file contains Turkish characters. So I decode file using utf-8. re library doesn't know Turkish character. Below code isn't working.
text= unicodedata.normalize("NFKD",codecs.open(os.path.abspath("texts/kopru1.txt"),"rb").read().decode("utf-8"))
text=text.replace("\r\n"," ").lower()
aa= re.findall(ur"[a-zçşıöü]+", text,re.UNICODE)
Although "ayşe" is a word, this word seems as of "ays" and "e".
Junuxx
14.3k5 gold badges43 silver badges74 bronze badges
1 Answer 1
Use the escape sequence \w which means "a letter of any kind." Just getting an example sentence from wikipedia:
>>> text = u'Türkî-i çin (güzel güneş) terkiplerinde de gördüğümüz'
>>> re.findall(r'\w+', text, re.UNICODE)
['Türkî', 'i', 'çin', 'güzel', 'güneş', 'terkiplerinde', 'de', 'gördüğümüz']
answered Jun 11, 2013 at 17:03
kqr
15.1k3 gold badges43 silver badges73 bronze badges
Sign up to request clarification or add additional context in comments.
7 Comments
hinzir
I had done before. And again I did. But the code isn't still working.
kqr
@hinzir what does your
text variable look like before you try to match on it?kqr
@hinzir Oh, right. I've updated my reply with additional hints. See if it helps.
hinzir
@kgr you are wrong about replace method. docs.python.org/2/library/string.html#string.replace "Return a copy of string s with all occurrences of substring old replaced by new. "
kqr
@hinzir That's really, really weird. When I tested yesterday I could swear it wouldn't return a new string, but now that I try it does.
|
lang-py
text.split(" ")?