in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem. But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).
What's the problem here, and how can I handle this?
Thanks a lot for any help!
EDIT:
At the moment, I use the following: output.write(text.decode("utf-8"))
This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\xc2\xadrung
How can we solve this problem? Thanks a lot!
2 Answers 2
There is U+00AD SOFT HYPHEN before r in the string:
>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'
To remove non-ascii characters:
>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11
5 Comments
.decode to get Unicode string that has only ascii characters: b"Schilderung".decode('ascii', 'ignore') -> u'Schilderung'.decode('utf-8') as shown in the very first line in the answer and stop at that. If it is not enough then update your question to specify filtering rules i.e., what categories of characters to remove (blacklist), what to preserve (whitelist), etc. There are no universal rules; you need to choose what is appropriate in your case.Seems like "r" isn't ASCII:
>>> u'Schilderung'
u'Schilde\xadrung'
print(repr(the_word))