How to decode strings saved in utf-8 format

Question 1

I'm trying to decode the strings in the list below. They were all encoded in utf-8 format.

_strs=['."\n\nThe vicar\'',':--\n\nIn the', 'cathedral']

Expected output:

['.The vicar', ':--In the', 'cathedral']

My attempts

>>> for x in _str:
 x.decode('string_escape')
 print x
'."\n\nThe vicar\''
."
The vicar'
':--\n\nIn the'
:--
In the
'cathedral'
cathedral
>>> print [x.decode('string_escape') for x in _str]
['."\n\nThe vicar\'', ':--\n\nIn the', 'cathedral']

Both attempts failed. Any ideas?

Question 2

So you want to remove some characters from your list, it can be done using a simple regex like in the following:

import re
print [re.sub(r'[."\'\n]','',x) for x in _str]

this regex removes all the (., ", ', \n) and the result will be:

['The vicar', ':--In the', 'cathedral']

hope this helps.

Question 3

I want to retain all the punctuation marks. Sorry I did not state that in my question or expected output. The punctuation marks are too many, I don't know of any method that decodes automatically other than selective removal of unwanted character using a reg-ex.

Question 4

any characters you want to keep, don't put in the regex. so if you want the . to be in the output like your last edit, then make the regex = ["\'\n]

Question 5

you are right about that, but my dataset is too large, and the characters are numerous. If there's no standard method for decoding, then I'll have to build a punctuation list and adopt your solution. Thanks a lot bro.

Ammar 1,3142 gold badges11 silver badges16 bronze badges · Accepted Answer · 2014-04-08 14:11:44Z

1

So you want to remove some characters from your list, it can be done using a simple regex like in the following:

import re
print [re.sub(r'[."\'\n]','',x) for x in _str]

this regex removes all the (., ", ', \n) and the result will be:

['The vicar', ':--In the', 'cathedral']

hope this helps.

Share

Improve this answer

answered Apr 8, 2014 at 14:11

Ammar's user avatar

Ammar

1,3142 gold badges11 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tiger1

Tiger1 Over a year ago

I want to retain all the punctuation marks. Sorry I did not state that in my question or expected output. The punctuation marks are too many, I don't know of any method that decodes automatically other than selective removal of unwanted character using a reg-ex.

2014年04月08日T14:26:45.57Z+00:00

Ammar

Ammar Over a year ago

any characters you want to keep, don't put in the regex. so if you want the . to be in the output like your last edit, then make the regex = ["\'\n]

2014年04月08日T14:41:30.437Z+00:00

Tiger1

Tiger1 Over a year ago

you are right about that, but my dataset is too large, and the characters are numerous. If there's no standard method for decoding, then I'll have to build a punctuation list and adopt your solution. Thanks a lot bro.

2014年04月08日T14:44:38.3Z+00:00

CollectivesTM on Stack Overflow

How to decode strings saved in utf-8 format

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related