14

I would like to compare 2 strings and have True if the strings are identical, without considering the accents.

Example : I would like the following code to print 'Bonjour'

if 'séquoia' in 'Mon sequoia est vert':
 print 'Bonjour'
Arnaud P
12.7k8 gold badges59 silver badges71 bronze badges
asked Dec 22, 2013 at 13:14
2
  • 1
    Convert to fully decomposed normal form, remove accents, compare. Commented Dec 22, 2013 at 13:17
  • Linked: stackoverflow.com/questions/517923/… Commented May 22, 2020 at 17:20

3 Answers 3

15

You should use unidecode function from Unidecode package:

from unidecode import unidecode
if unidecode(u'séquoia') in 'Mon sequoia est vert':
 print 'Bonjour'
vikingosegundo
52.3k14 gold badges140 silver badges184 bronze badges
answered Dec 22, 2013 at 13:24
Sign up to request clarification or add additional context in comments.

Comments

6

You should take a look at Unidecode. With the module and this method, you can get a string without accent and then make your comparaison:

def remove_accents(data):
 return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower()
if remove_accents('séquoia') in 'Mon sequoia est vert':
 # Do something
 pass

Reference from stackoverflow

answered Dec 22, 2013 at 13:18

1 Comment

This would not work if the word was "séQuoIa" since the remove_accents method makes all of the characters lowercase.
6

(sorry, late to the party!!)

How about instead, doing this:

>>> unicodedata.normalize('NFKD', 'î ï í ī į ì').encode('ASCII', 'ignore').decode('ascii')
'i i i i i i'

No need to loop over anything. @Maxime Lorant answer is very inefficient.

>>> import timeit
>>> code = """
import string, unicodedata
def remove_accents(data):
 return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower()
"""
>>> timeit.timeit("remove_accents('séquoia')", setup=code)
3.6028339862823486
>>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup='import unicodedata')
0.7447490692138672

Hint: less is better

Also, I'm sure the package unidecode @Seur suggested has other advantages, but it is still very slow compared to the native option that requires no 3rd party libraries.

>>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup="import unicodedata")
0.7662729263305664
>>> timeit.timeit("unidecode.unidecode('séquoia')", setup="import unidecode")
7.489392042160034

Hint: less is better

Putting it all together:

clean_text = unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore').decode('ascii')
if clean_text in 'Mon sequoia est vert':
 ...
answered Aug 14, 2018 at 11:51

2 Comments

I tried using Python 3.12. If we need to encode with ASCII at all we need also to revert it to be able to use the in operator. Bytes can't be at the left side of in. unicodedata.normalize('NFKD', 'î ï í ī į ì Í').encode('ASCII', 'ignore').decode('ASCII') I think the answer above is for Python 2.x
@ArpadHorvath--СлаваУкраїні It is for py 2.x, this is over 4 years old :P notice the u in front of the strings

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.