I'm trying to decode the strings in the list below. They were all encoded in utf-8 format.
_strs=['."\n\nThe vicar\'',':--\n\nIn the', 'cathedral']
Expected output:
['.The vicar', ':--In the', 'cathedral']
My attempts
>>> for x in _str:
x.decode('string_escape')
print x
'."\n\nThe vicar\''
."
The vicar'
':--\n\nIn the'
:--
In the
'cathedral'
cathedral
>>> print [x.decode('string_escape') for x in _str]
['."\n\nThe vicar\'', ':--\n\nIn the', 'cathedral']
Both attempts failed. Any ideas?
asked Apr 8, 2014 at 13:55
Tiger1
1,3775 gold badges21 silver badges40 bronze badges
1 Answer 1
So you want to remove some characters from your list, it can be done using a simple regex like in the following:
import re
print [re.sub(r'[."\'\n]','',x) for x in _str]
this regex removes all the (., ", ', \n) and the result will be:
['The vicar', ':--In the', 'cathedral']
hope this helps.
answered Apr 8, 2014 at 14:11
Ammar
1,3142 gold badges11 silver badges16 bronze badges
Sign up to request clarification or add additional context in comments.
3 Comments
Tiger1
I want to retain all the punctuation marks. Sorry I did not state that in my question or expected output. The punctuation marks are too many, I don't know of any method that decodes automatically other than selective removal of unwanted character using a reg-ex.
Ammar
any characters you want to keep, don't put in the regex. so if you want the
. to be in the output like your last edit, then make the regex = ["\'\n]Tiger1
you are right about that, but my dataset is too large, and the characters are numerous. If there's no standard method for decoding, then I'll have to build a punctuation list and adopt your solution. Thanks a lot bro.
lang-py