An anagram is a word, a phrase, or a sentence formed from another by rearranging its letters. An example of this is "angel", which is an anagram of "glean".
Write a function that receives an array of words, and returns the total number of distinct pairs of anagramic words inside it.
Some examples:
There are 2 anagrams in the array
["dell", "ledl", "abc", "cba"]
There are 7 anagrams in the array
["dell", "ledl", "abc", "cba", "bca", "bac"]
My code is usually not optimal, and I would like to improve my coding practices, and coding sense which is why I try to put up for review all nontrivial code I write.
My Code
from collections import Counter
import math
def choose(n, k): #Each distinct anagram pair is a selection from the set of all words that have the same count.
f = math.factorial
return (f(n)//(f(n-k)*f(k)))
def anagram_counter(words):
words = list(set(words)) #Suppress duplicates.
unique = set() #Set of unique words.
count = {} #Dictionary that stores the count for each word.
unique = set()
for word in words:
#The current word is not an anagram of any word already in the set.
wordID = Counter(word)
if not unique or all((wordID != count[distinct][1] for distinct in unique)):
unique.add(word)
count[word] = [1,wordID] #A tuple containing number of anagrams of a word and its `wordID`.
else: #If the current word is an anagram of a word already in the set.
for distinct in list(unique):
if count[distinct][1] == wordID: #If the word is an anagram of a preexisting word.
count[distinct][0] += 1 #Increment the counter.
break
return 0 if count == {} else sum((choose(itm[0], 2) for itm in count.values() if itm[0] > 1))
2 Answers 2
You are making your life too difficult, IMO. Whenever you iterate over all members of a set
to see if some element is in it or write list(unique)
, you are probably doing something wrong.
I would just transform each word into a canonical form (you could choose a frozenset
of the Counter
items or just a sorted string). Then just count how often each appears:
def anagram_counter(words):
count = Counter(frozenset(Counter(word).items()) for word in words)
return sum(choose(x, 2) for x in count.values() if x > 1)
def anagram_counter2(words):
count = Counter("".join(sorted(word)) for word in words)
return sum(choose(x, 2) for x in count.values() if x > 1)
You could optimize the last line by using Counter.most_common
and stopping as soon as you get to the elements that appeared only once:
from itertools import takewhile
def anagram_counter3(words):
count = Counter("".join(sorted(word)) for word in words)
return sum(choose(x[1], 2)
for x in takewhile(lambda t: t[1] > 1, count.most_common()))
Comparing the timings for some small input:
x = ["foo", "bar", "oof", "rab", "foobar"]
%timeit anagram_counter(x)
# 27.2 μs ± 1.4 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit anagram_counter2(x)
# 9.71 μs ± 656 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit anagram_counter3(x)
# 11.9 μs ± 492 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit anagram_counter_op(x)
# 25.6 μs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And for some larger inputs:
import random
import string
import numpy as np
# increasing number of words, always 5 letters
x1 = [["".join(random.choices(string.ascii_lowercase, k=5)) for _ in range(n)]
for n in np.logspace(1, 4, num=10, dtype=int)]
# increasing length of words, always 500 words
x2 = [["".join(random.choices(string.ascii_lowercase, k=n)) for _ in range(500)]
for n in np.logspace(1, 4, num=10, dtype=int)]
(Note that both axis are logarithmic on both plots.)
Graipher answer is nice, but there is one possible inefficiency not taken into account: choice.
If you have a lot of anagrams, it's better to replace the generic version with the explicit formula for pair:
def count_pairs(n):
return (n * (n-1)) // 2
here some timings, with a big list with only a few different canonical anagrams:
def random_anagram(w):
l = w[:]
random.shuffle(l)
return "".join(l)
base_anagrams = [random.choices(string.ascii_lowercase, k=30) for i in range(4)]
x4 = [random_anagram(random.choice(base_anagrams)) for _ in range(100000)]
def anagram_counter5(words):
count = Counter("".join(sorted(word)) for word in words)
return sum(count_pairs(x) for x in count.values() if x > 1)
gives on my machine
%timeit anagram_counter2(x)
353 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit anagram_counter5(x)
253 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Explore related questions
See similar questions with these tags.