I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:
wordlist = [s.translate(None, string.punctuation)for s in valuelist]
TypeError: translate() takes exactly one argument (2 given)
Is there a way around this? Either by encoding the unicode or a replacement for string.translate?
5 Answers 5
The translate method work differently on Unicode objects than on byte-string objects:
>>> help(unicode.translate) S.translate(table) -> unicode Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.
So your example would become:
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]
Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.
3 Comments
import stringI noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.
>>> import re
>>> s1="this.is a.string, with; (punctuation)."
>>> s1
'this.is a.string, with; (punctuation).'
>>> re.sub("[\.\t,円\:;\(\)\.]", "", s1, 0, 0)
'thisis astring with punctuation'
>>>
2 Comments
string.translate is deprecated in favor of the method str.translate, the translate method (which OP is using) is still usable.In this version you can relatively make one's letters to other
def trans(to_translate):
tabin = u'привет'
tabout = u'тевирп'
tabin = [ord(char) for char in tabin]
translate_table = dict(zip(tabin, tabout))
return to_translate.translate(translate_table)
Comments
Python re module allows to use a function as a replacement argument, which should take a Match object and return a suitable replacement. We may use this function to build a custom character translation function:
import re
def mk_replacer(oldchars, newchars):
"""A function to build a replacement function"""
mapping = dict(zip(oldchars, newchars))
def replacer(match):
"""A replacement function to pass to re.sub()"""
return mapping.get(match.group(0), "")
return replacer
An example. Match all lower-case letters ([a-z]), translate 'h' and 'i' to 'H' and 'I' respectively, delete other matches:
>>> re.sub("[a-z]", mk_replacer("hi", "HI"), "hail")
'HI'
As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.
A Unicode example:
>>> re.sub("[\W]", mk_replacer(u'\u0435\u0438\u043f\u0440\u0442\u0432', u"EIPRTV"), u'\u043f\u0440\u0438\u0432\u0435\u0442')
u'PRIVET'
Comments
As I stumbled upon the same problem and Simon's answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:
from collections import defaultdict
And then for the translation, say you'd like to remove '@' and '\r' characters:
remove_chars_map = defaultdict()
remove_chars_map['@'] = None
remove_chars_map['\r'] = None
new_string = old_string.translate(remove_chars_map)
And an example:
old_string = "word1@\r word2@\r word3@\r"
new_string = "word1 word2 word3"
'@' and '\r' removed
s.encode('utf-8').translate(None, string.punctuation)worked for me.