How to decode a unicode character in string?

Question 1

I have the following String:

Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.

This string contains '\u0019t'. I cannot decode, because it's already a string. If I encode first, then decode, it still shows '\u0019t'. How do I get this to show a ' ?

Question 2

Unicode 19 is not a printable character, so there was a mistake made during encoding I presume. Whatever you try, it is unlikely that it will result in a apostrophe. Unless you find / decode it yourself by explicitely creating a map from value 0019 hex to the apostrophe.

Question 3

One option is to literal_eval it:

import ast
s = r"Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction. \u2661"
r = ast.literal_eval(f'"{s}"')
print(r)

Output:

Conversely, companies that arent sharp-eyed enoughto see that their real Dumbwaiter Pitches are lame, tired, or just plain evil well, they usually endup facing extinction. ♡

Question 4

This doesn't seem to work

Question 5

@bereal Missed some quotes. Try now

Question 6

Of course, Unicode 19 is not a printable character, so it is still missing in action.

Question 7

Yeah. I added a heart to the end for that reason... @FelipeFaria The unicode characters in the OPs string aren't printable.

Question 8

@Error-SyntacticalRemorse without r the code example does not really prove the solution, because all the codes are substituted already before calling literal_eval :)

Question 9

Somehow, the Unicode escape string is 2000 hex off the mark. Unicode dash and apostrophe are:

Unicode Character 'EM DASH' (U+2014)

and

Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)

So lets fix it anyway, even if the error is at the source (THEM) not the destination:

import re
text = r'Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.'
pattern = r'\\u([0-9a-fA-F]{4})'
# used to indicate the end of the previous match
# to save the string parts that don't need character encoding
off = 0
# start with an empty string
s = r''
# find and iterate over all matches of \uHHHH where H is a hex digit
for u in re.finditer(pattern, text):
 # append anything up to the unicode escape
 s += text[off:u.start()]
 # fix encoding mistake, unicode escapes are 2000 hex off the mark
 # then append it
 s += chr(int(u.group(1), 16) + 0x2000)
 # set off to the end of the match
 off = u.end()
# append everything from the last match to the end of the line
s += text[off:len(text)]
print(s)

prints out

Conversely, companies that aren’t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil — well, they usually end up facing extinction.

Note though that I've happily ignored any possible presence of \\u00xx in the text (where the backslash itself is escaped), that's something I'll leave for you to solve. Any correct Unicode escapes in the text will, of course, be altered as well.

Question 10

"Conversely, companies that aren't sharp-eyed enough to see that their character encoding routines are lame, tired, or just plain evil - well, they usually end up facing extinction."

Error - Syntactical Remorse 7,9454 gold badges29 silver badges58 bronze badges · Accepted Answer · 2020-01-14 15:31:51Z

2

One option is to literal_eval it:

import ast
s = r"Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction. \u2661"
r = ast.literal_eval(f'"{s}"')
print(r)

Output:

Conversely, companies that arent sharp-eyed enoughto see that their real Dumbwaiter Pitches are lame, tired, or just plain evil well, they usually endup facing extinction. ♡

Share

Improve this answer

edited Jan 14, 2020 at 15:45

answered Jan 14, 2020 at 15:31

Error - Syntactical Remorse's user avatar

Error - Syntactical Remorse

7,9454 gold badges29 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

bereal

bereal Over a year ago

This doesn't seem to work

2020年01月14日T15:33:56.357Z+00:00

Error - Syntactical Remorse

Error - Syntactical Remorse Over a year ago

@bereal Missed some quotes. Try now

2020年01月14日T15:34:16.647Z+00:00

Maarten Bodewes

Maarten Bodewes Over a year ago

Of course, Unicode 19 is not a printable character, so it is still missing in action.

2020年01月14日T15:36:47.433Z+00:00

Error - Syntactical Remorse

Error - Syntactical Remorse Over a year ago

Yeah. I added a heart to the end for that reason... @FelipeFaria The unicode characters in the OPs string aren't printable.

2020年01月14日T15:37:02.433Z+00:00

bereal

bereal Over a year ago

@Error-SyntacticalRemorse without r the code example does not really prove the solution, because all the codes are substituted already before calling literal_eval :)

2020年01月14日T15:44:22.257Z+00:00

|

CollectivesTM on Stack Overflow

How to decode a unicode character in string?

2 Answers 2

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related