I have a text in Polish language in which I want to filter out non-Polish letters, but the problem is that Polish specific letters disappear
# coding: utf-8
import re
_NOT_LETTERS = re.compile('[^a-ząćęłóńśżź]+')
text = u'dzień dobry i wszystkiego najlepszego życzę'
data = _NOT_LETTERS.sub(' ', text)
print data
and the result is
dzie dobry i wszystkiego najlepszego ycz
instead of expected
dzień dobry i wszystkiego najlepszego życzę
How can I fix this ? I receive variable text from a third-party library
1 Answer 1
Accented letters are not in the ascii range and need several bytes when encoded in UTF-8, for example the character:
U+0144 ń LATIN SMALL LETTER N WITH ACUTE
is encoded on two bytes: c5 84
When you write a string without specifying that it is a string with multibyte characters, each single byte is seen as a character (the character \xc5 and the character \x84 but not the character ń (U+0144) that isn't recognized.)
In Python 2.7 you need to specify that your string is a unicode string otherwise all multibyte characters are seen as single bytes. You can test it yourself writing:
>>> text = u'dzień'
>>> [c for c in text]
[u'd', u'z', u'i', u'e', u'\u0144']
>>> text = 'dzień'
>>> [c for c in text]
['d', 'z', 'i', 'e', '\xc5', '\x84']
Characters are not found because your pattern isn't in a unicode string like your subject string. You need to write:
re.compile(u'[^a-ząćęłóńśżź]+')
re.compile(u'[^a-ząćęłóńśżź]+')otherwise multibyte characters are seen as separated bytes (ie: one byte, one char).