Compare 2 strings without considering accents in Python [duplicate]

Question 1

I would like to compare 2 strings and have True if the strings are identical, without considering the accents.

Example : I would like the following code to print 'Bonjour'

if 'séquoia' in 'Mon sequoia est vert':
 print 'Bonjour'

Question 2

Convert to fully decomposed normal form, remove accents, compare.

Question 3

Linked: stackoverflow.com/questions/517923/…

Question 4

You should use unidecode function from Unidecode package:

from unidecode import unidecode
if unidecode(u'séquoia') in 'Mon sequoia est vert':
 print 'Bonjour'

Question 5

You should take a look at Unidecode. With the module and this method, you can get a string without accent and then make your comparaison:

def remove_accents(data):
 return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower()
if remove_accents('séquoia') in 'Mon sequoia est vert':
 # Do something
 pass

Reference from stackoverflow

Question 6

This would not work if the word was "séQuoIa" since the remove_accents method makes all of the characters lowercase.

Question 7

(sorry, late to the party!!)

How about instead, doing this:

>>> unicodedata.normalize('NFKD', 'î ï í ī į ì').encode('ASCII', 'ignore').decode('ascii')
'i i i i i i'

No need to loop over anything. @Maxime Lorant answer is very inefficient.

>>> import timeit
>>> code = """
import string, unicodedata
def remove_accents(data):
 return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower()
"""
>>> timeit.timeit("remove_accents('séquoia')", setup=code)
3.6028339862823486
>>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup='import unicodedata')
0.7447490692138672

Hint: less is better

Also, I'm sure the package unidecode @Seur suggested has other advantages, but it is still very slow compared to the native option that requires no 3rd party libraries.

>>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup="import unicodedata")
0.7662729263305664
>>> timeit.timeit("unidecode.unidecode('séquoia')", setup="import unidecode")
7.489392042160034

Hint: less is better

Putting it all together:

clean_text = unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore').decode('ascii')
if clean_text in 'Mon sequoia est vert':
 ...

Question 8

I tried using Python 3.12. If we need to encode with ASCII at all we need also to revert it to be able to use the in operator. Bytes can't be at the left side of in. unicodedata.normalize('NFKD', 'î ï í ī į ì Í').encode('ASCII', 'ignore').decode('ASCII') I think the answer above is for Python 2.x

Question 9

@ArpadHorvath--СлаваУкраїні It is for py 2.x, this is over 4 years old :P notice the u in front of the strings

Suor 3,0831 gold badge24 silver badges28 bronze badges · Accepted Answer · 2013-12-22 13:24:16Z

15

You should use unidecode function from Unidecode package:

from unidecode import unidecode
if unidecode(u'séquoia') in 'Mon sequoia est vert':
 print 'Bonjour'

Share

Improve this answer

edited Dec 22, 2013 at 13:34

vikingosegundo's user avatar

vikingosegundo

52.3k14 gold badges140 silver badges184 bronze badges

answered Dec 22, 2013 at 13:24

Suor's user avatar

Suor

3,0831 gold badge24 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

CollectivesTM on Stack Overflow

Compare 2 strings without considering accents in Python [duplicate]

3 Answers 3

Comments

1 Comment

2 Comments

Linked

Hot Network Questions