I have the following String:
Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.
This string contains '\u0019t'. I cannot decode, because it's already a string. If I encode first, then decode, it still shows '\u0019t'. How do I get this to show a ' ?
2 Answers 2
One option is to literal_eval it:
import ast
s = r"Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction. \u2661"
r = ast.literal_eval(f'"{s}"')
print(r)
Output:
Conversely, companies that arent sharp-eyed enoughto see that their real Dumbwaiter Pitches are lame, tired, or just plain evil well, they usually endup facing extinction. ♡
8 Comments
r the code example does not really prove the solution, because all the codes are substituted already before calling literal_eval :)Somehow, the Unicode escape string is 2000 hex off the mark. Unicode dash and apostrophe are:
Unicode Character 'EM DASH' (U+2014)
and
Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)
So lets fix it anyway, even if the error is at the source (THEM) not the destination:
import re
text = r'Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.'
pattern = r'\\u([0-9a-fA-F]{4})'
# used to indicate the end of the previous match
# to save the string parts that don't need character encoding
off = 0
# start with an empty string
s = r''
# find and iterate over all matches of \uHHHH where H is a hex digit
for u in re.finditer(pattern, text):
# append anything up to the unicode escape
s += text[off:u.start()]
# fix encoding mistake, unicode escapes are 2000 hex off the mark
# then append it
s += chr(int(u.group(1), 16) + 0x2000)
# set off to the end of the match
off = u.end()
# append everything from the last match to the end of the line
s += text[off:len(text)]
print(s)
prints out
Conversely, companies that aren’t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil — well, they usually end up facing extinction.
Note though that I've happily ignored any possible presence of \\u00xx in the text (where the backslash itself is escaped), that's something I'll leave for you to solve. Any correct Unicode escapes in the text will, of course, be altered as well.
0019hex to the apostrophe.