string.translate() with unicode data in python

Question 1

I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:

 wordlist = [s.translate(None, string.punctuation)for s in valuelist]
TypeError: translate() takes exactly one argument (2 given)

Is there a way around this? Either by encoding the unicode or a replacement for string.translate?

Question 2

s.encode('utf-8').translate(None, string.punctuation) worked for me.

Question 3

@Suzana_K Thank you! This was the simplest solution for me.

Question 4

related: Remove punctuation from Unicode formatted strings

Question 5

The translate method work differently on Unicode objects than on byte-string objects:

>>> help(unicode.translate)
S.translate(table) -> unicode
Return a copy of the string S, where all characters have been mapped
through the given translation table, which must be a mapping of
Unicode ordinals to Unicode ordinals, Unicode strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted.

So your example would become:

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]

Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.

Question 6

dict.fromkeys(map(ord, string.punctuation))

Question 7

this is by far the best of all the answers on here. thanks.

Question 8

thanks! just to note, I think this implies import string

Question 9

I noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.

 >>> import re
 >>> s1="this.is a.string, with; (punctuation)."
 >>> s1
 'this.is a.string, with; (punctuation).'
 >>> re.sub("[\.\t,円\:;\(\)\.]", "", s1, 0, 0)
 'thisis astring with punctuation'
 >>>

Question 10

the translate function works great in python 2.7 and is computationally faster than REGEX. I may have no other option though. Thanks

Question 11

The module function string.translate is deprecated in favor of the method str.translate, the translate method (which OP is using) is still usable.

Question 12

In this version you can relatively make one's letters to other

def trans(to_translate):
 tabin = u'привет'
 tabout = u'тевирп'
 tabin = [ord(char) for char in tabin]
 translate_table = dict(zip(tabin, tabout))
 return to_translate.translate(translate_table)

Question 13

Python re module allows to use a function as a replacement argument, which should take a Match object and return a suitable replacement. We may use this function to build a custom character translation function:

import re
def mk_replacer(oldchars, newchars):
 """A function to build a replacement function"""
 mapping = dict(zip(oldchars, newchars))
 def replacer(match):
 """A replacement function to pass to re.sub()"""
 return mapping.get(match.group(0), "")
 return replacer

An example. Match all lower-case letters ([a-z]), translate 'h' and 'i' to 'H' and 'I' respectively, delete other matches:

>>> re.sub("[a-z]", mk_replacer("hi", "HI"), "hail")
'HI'

As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.

A Unicode example:

>>> re.sub("[\W]", mk_replacer(u'\u0435\u0438\u043f\u0440\u0442\u0432', u"EIPRTV"), u'\u043f\u0440\u0438\u0432\u0435\u0442')
u'PRIVET'

Question 14

As I stumbled upon the same problem and Simon's answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:

from collections import defaultdict

And then for the translation, say you'd like to remove '@' and '\r' characters:

remove_chars_map = defaultdict()
remove_chars_map['@'] = None
remove_chars_map['\r'] = None
new_string = old_string.translate(remove_chars_map)

And an example:

old_string = "word1@\r word2@\r word3@\r"

new_string = "word1 word2 word3"

'@' and '\r' removed

Simon Sapin 10.2k3 gold badges39 silver badges45 bronze badges · Accepted Answer · 2012-07-27 18:50:26Z

The translate method work differently on Unicode objects than on byte-string objects:

>>> help(unicode.translate)
S.translate(table) -> unicode
Return a copy of the string S, where all characters have been mapped
through the given translation table, which must be a mapping of
Unicode ordinals to Unicode ordinals, Unicode strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted.

So your example would become:

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]

Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.

CollectivesTM on Stack Overflow

string.translate() with unicode data in python

5 Answers 5

3 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

3 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related