1

I have the following String:

Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.

This string contains '\u0019t'. I cannot decode, because it's already a string. If I encode first, then decode, it still shows '\u0019t'. How do I get this to show a ' ?

asked Jan 14, 2020 at 15:26
1
  • 2
    Unicode 19 is not a printable character, so there was a mistake made during encoding I presume. Whatever you try, it is unlikely that it will result in a apostrophe. Unless you find / decode it yourself by explicitely creating a map from value 0019 hex to the apostrophe. Commented Jan 14, 2020 at 15:39

2 Answers 2

2

One option is to literal_eval it:

import ast
s = r"Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction. \u2661"
r = ast.literal_eval(f'"{s}"')
print(r)

Output:

Conversely, companies that arent sharp-eyed enoughto see that their real Dumbwaiter Pitches are lame, tired, or just plain evil well, they usually endup facing extinction. ♡
answered Jan 14, 2020 at 15:31
Sign up to request clarification or add additional context in comments.

8 Comments

@bereal Missed some quotes. Try now
Of course, Unicode 19 is not a printable character, so it is still missing in action.
Yeah. I added a heart to the end for that reason... @FelipeFaria The unicode characters in the OPs string aren't printable.
@Error-SyntacticalRemorse without r the code example does not really prove the solution, because all the codes are substituted already before calling literal_eval :)
|
0

Somehow, the Unicode escape string is 2000 hex off the mark. Unicode dash and apostrophe are:

Unicode Character 'EM DASH' (U+2014)

and

Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)

So lets fix it anyway, even if the error is at the source (THEM) not the destination:

import re
text = r'Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.'
pattern = r'\\u([0-9a-fA-F]{4})'
# used to indicate the end of the previous match
# to save the string parts that don't need character encoding
off = 0
# start with an empty string
s = r''
# find and iterate over all matches of \uHHHH where H is a hex digit
for u in re.finditer(pattern, text):
 # append anything up to the unicode escape
 s += text[off:u.start()]
 # fix encoding mistake, unicode escapes are 2000 hex off the mark
 # then append it
 s += chr(int(u.group(1), 16) + 0x2000)
 # set off to the end of the match
 off = u.end()
# append everything from the last match to the end of the line
s += text[off:len(text)]
print(s)

prints out

Conversely, companies that aren’t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil — well, they usually end up facing extinction.

Note though that I've happily ignored any possible presence of \\u00xx in the text (where the backslash itself is escaped), that's something I'll leave for you to solve. Any correct Unicode escapes in the text will, of course, be altered as well.

answered Jan 14, 2020 at 16:26

1 Comment

"Conversely, companies that aren't sharp-eyed enough to see that their character encoding routines are lame, tired, or just plain evil - well, they usually end up facing extinction."

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.