Python - wrong encoding, regexp

Asked 9 years, 7 months ago

Viewed 122 times

I have a text in Polish language in which I want to filter out non-Polish letters, but the problem is that Polish specific letters disappear

# coding: utf-8
import re
_NOT_LETTERS = re.compile('[^a-ząćęłóńśżź]+')
text = u'dzień dobry i wszystkiego najlepszego życzę'
data = _NOT_LETTERS.sub(' ', text)
print data

and the result is

 dzie dobry i wszystkiego najlepszego ycz

instead of expected

dzień dobry i wszystkiego najlepszego życzę

How can I fix this ? I receive variable text from a third-party library

Improve this question

asked May 24, 2016 at 22:42

Mateo2's user avatar

Mateo2

332 bronze badges

The pattern must use a unicode string too: re.compile(u'[^a-ząćęłóńśżź]+') otherwise multibyte characters are seen as separated bytes (ie: one byte, one char).

Casimir et Hippolyte
– Casimir et Hippolyte

2016年05月24日 22:53:50 +00:00
Commented May 24, 2016 at 22:53
Great, it works. If you want add an answer and I'll accept it

Mateo2
– Mateo2

2016年05月24日 22:59:04 +00:00
Commented May 24, 2016 at 22:59

Add a comment |

1 Answer 1

Sorted by: Reset to default

Accented letters are not in the ascii range and need several bytes when encoded in UTF-8, for example the character:

U+0144 ń LATIN SMALL LETTER N WITH ACUTE

is encoded on two bytes: c5 84

When you write a string without specifying that it is a string with multibyte characters, each single byte is seen as a character (the character \xc5 and the character \x84 but not the character ń (U+0144) that isn't recognized.)

In Python 2.7 you need to specify that your string is a unicode string otherwise all multibyte characters are seen as single bytes. You can test it yourself writing:

>>> text = u'dzień'
>>> [c for c in text]
[u'd', u'z', u'i', u'e', u'\u0144']
>>> text = 'dzień'
>>> [c for c in text]
['d', 'z', 'i', 'e', '\xc5', '\x84']

Characters are not found because your pattern isn't in a unicode string like your subject string. You need to write:

re.compile(u'[^a-ząćęłóńśżź]+')

Improve this answer

edited May 24, 2016 at 23:39

answered May 24, 2016 at 23:24

Casimir et Hippolyte's user avatar

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Python - wrong encoding, regexp

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related