Python: Encoding issues?

Question 1

in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem. But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).

What's the problem here, and how can I handle this?

Thanks a lot for any help!

EDIT: At the moment, I use the following: output.write(text.decode("utf-8")) This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\xc2\xadrung How can we solve this problem? Thanks a lot!

Question 2

show us print(repr(the_word))

Question 3

Is there an umlaut or some other special char in the string?

Question 4

Yes, there are umlaut and other special char in the string. So, I should handle the problem with "schilde rung" (which works with the printable or encode-solutions) BUT I should also keep the umlaut and other special char which are correctly represented...

Question 5

There is U+00AD SOFT HYPHEN before r in the string:

>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'

To remove non-ascii characters:

>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

Question 6

Yeah, that's exactly the problem. Is there any way that I can check each word for this property? Because while applying this to all words, I get the error: "'ascii' codec can't decode byte 0xc3 in position 1162: ordinal not in range(128)"

Question 7

@MarkF6: the error means that you are trying to encode bytes (that you should not do) instead of a Unicode string. If your input is a byte string that contains text in utf-8; you could call .decode to get Unicode string that has only ascii characters: b"Schilderung".decode('ascii', 'ignore') -> u'Schilderung'

Question 8

Ok, I did this, but what's about 'umlaut'?. If I do this, I lose all the 'umlaut'. (ä ö ü)

Question 9

@MarkF6: If you don't mind non-ascii characters then just convert your input bytes to Unicode e.g. using .decode('utf-8') as shown in the very first line in the answer and stop at that. If it is not enough then update your question to specify filtering rules i.e., what categories of characters to remove (blacklist), what to preserve (whitelist), etc. There are no universal rules; you need to choose what is appropriate in your case.

Question 10

I added some specifications as comment and as EDIT to the initial post. Thanks a lot for the help!

Question 11

Seems like "r" isn't ASCII:

>>> u'Schilderung'
u'Schilde\xadrung'

jfs 417k211 gold badges1k silver badges1.7k bronze badges · Accepted Answer · 2013-09-06 10:01:29Z

1

There is U+00AD SOFT HYPHEN before r in the string:

>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'

To remove non-ascii characters:

>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

Share

Improve this answer

answered Sep 6, 2013 at 10:01

jfs's user avatar

jfs

417k211 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MarkF6

MarkF6 Over a year ago

Yeah, that's exactly the problem. Is there any way that I can check each word for this property? Because while applying this to all words, I get the error: "'ascii' codec can't decode byte 0xc3 in position 1162: ordinal not in range(128)"

2013年09月06日T10:18:50.783Z+00:00

jfs

jfs Over a year ago

@MarkF6: the error means that you are trying to encode bytes (that you should not do) instead of a Unicode string. If your input is a byte string that contains text in utf-8; you could call .decode to get Unicode string that has only ascii characters: b"Schilderung".decode('ascii', 'ignore') -> u'Schilderung'

2013年09月06日T10:28:38.743Z+00:00

MarkF6

MarkF6 Over a year ago

Ok, I did this, but what's about 'umlaut'?. If I do this, I lose all the 'umlaut'. (ä ö ü)

2013年09月06日T10:43:31.08Z+00:00

jfs

jfs Over a year ago

@MarkF6: If you don't mind non-ascii characters then just convert your input bytes to Unicode e.g. using .decode('utf-8') as shown in the very first line in the answer and stop at that. If it is not enough then update your question to specify filtering rules i.e., what categories of characters to remove (blacklist), what to preserve (whitelist), etc. There are no universal rules; you need to choose what is appropriate in your case.

2013年09月06日T10:52:43.013Z+00:00

MarkF6

MarkF6 Over a year ago

I added some specifications as comment and as EDIT to the initial post. Thanks a lot for the help!

2013年09月06日T11:32:27.303Z+00:00

CollectivesTM on Stack Overflow

Python: Encoding issues?

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related