Number of anagrams in an array of words (Codewars)

Question 1

An anagram is a word, a phrase, or a sentence formed from another by rearranging its letters. An example of this is "angel", which is an anagram of "glean".

Write a function that receives an array of words, and returns the total number of distinct pairs of anagramic words inside it.

Some examples:

There are 2 anagrams in the array ["dell", "ledl", "abc", "cba"]

There are 7 anagrams in the array ["dell", "ledl", "abc", "cba", "bca", "bac"]

My code is usually not optimal, and I would like to improve my coding practices, and coding sense which is why I try to put up for review all nontrivial code I write.

My Code

from collections import Counter
import math
def choose(n, k): #Each distinct anagram pair is a selection from the set of all words that have the same count.
 f = math.factorial
 return (f(n)//(f(n-k)*f(k)))
def anagram_counter(words):
 words = list(set(words)) #Suppress duplicates.
 unique = set() #Set of unique words.
 count = {} #Dictionary that stores the count for each word.
 unique = set()
 for word in words:
 #The current word is not an anagram of any word already in the set.
 wordID = Counter(word) 
 if not unique or all((wordID != count[distinct][1] for distinct in unique)):
 unique.add(word)
 count[word] = [1,wordID] #A tuple containing number of anagrams of a word and its `wordID`.
 else: #If the current word is an anagram of a word already in the set.
 for distinct in list(unique):
 if count[distinct][1] == wordID: #If the word is an anagram of a preexisting word.
 count[distinct][0] += 1 #Increment the counter.
 break
 return 0 if count == {} else sum((choose(itm[0], 2) for itm in count.values() if itm[0] > 1))

Question 2

You are making your life too difficult, IMO. Whenever you iterate over all members of a set to see if some element is in it or write list(unique), you are probably doing something wrong.

I would just transform each word into a canonical form (you could choose a frozenset of the Counter items or just a sorted string). Then just count how often each appears:

def anagram_counter(words):
 count = Counter(frozenset(Counter(word).items()) for word in words)
 return sum(choose(x, 2) for x in count.values() if x > 1)
def anagram_counter2(words):
 count = Counter("".join(sorted(word)) for word in words)
 return sum(choose(x, 2) for x in count.values() if x > 1)

You could optimize the last line by using Counter.most_common and stopping as soon as you get to the elements that appeared only once:

from itertools import takewhile
def anagram_counter3(words):
 count = Counter("".join(sorted(word)) for word in words)
 return sum(choose(x[1], 2)
 for x in takewhile(lambda t: t[1] > 1, count.most_common()))

Comparing the timings for some small input:

x = ["foo", "bar", "oof", "rab", "foobar"]
%timeit anagram_counter(x)
# 27.2 μs ± 1.4 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit anagram_counter2(x)
# 9.71 μs ± 656 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit anagram_counter3(x)
# 11.9 μs ± 492 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit anagram_counter_op(x)
# 25.6 μs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And for some larger inputs:

import random
import string
import numpy as np
# increasing number of words, always 5 letters
x1 = [["".join(random.choices(string.ascii_lowercase, k=5)) for _ in range(n)]
 for n in np.logspace(1, 4, num=10, dtype=int)]
# increasing length of words, always 500 words
x2 = [["".join(random.choices(string.ascii_lowercase, k=n)) for _ in range(500)]
 for n in np.logspace(1, 4, num=10, dtype=int)]

enter image description here

(Note that both axis are logarithmic on both plots.)

Question 3

Graipher answer is nice, but there is one possible inefficiency not taken into account: choice.

If you have a lot of anagrams, it's better to replace the generic version with the explicit formula for pair:

def count_pairs(n):
 return (n * (n-1)) // 2

here some timings, with a big list with only a few different canonical anagrams:

def random_anagram(w):
 l = w[:]
 random.shuffle(l)
 return "".join(l)
base_anagrams = [random.choices(string.ascii_lowercase, k=30) for i in range(4)]
x4 = [random_anagram(random.choice(base_anagrams)) for _ in range(100000)]
def anagram_counter5(words): 
 count = Counter("".join(sorted(word)) for word in words)
 return sum(count_pairs(x) for x in count.values() if x > 1)

gives on my machine

%timeit anagram_counter2(x)
353 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit anagram_counter5(x)
253 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Graipher Graipher 41.7k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-01-30 12:53:43Z

You are making your life too difficult, IMO. Whenever you iterate over all members of a set to see if some element is in it or write list(unique), you are probably doing something wrong.

I would just transform each word into a canonical form (you could choose a frozenset of the Counter items or just a sorted string). Then just count how often each appears:

def anagram_counter(words):
 count = Counter(frozenset(Counter(word).items()) for word in words)
 return sum(choose(x, 2) for x in count.values() if x > 1)
def anagram_counter2(words):
 count = Counter("".join(sorted(word)) for word in words)
 return sum(choose(x, 2) for x in count.values() if x > 1)

You could optimize the last line by using Counter.most_common and stopping as soon as you get to the elements that appeared only once:

from itertools import takewhile
def anagram_counter3(words):
 count = Counter("".join(sorted(word)) for word in words)
 return sum(choose(x[1], 2)
 for x in takewhile(lambda t: t[1] > 1, count.most_common()))

Comparing the timings for some small input:

x = ["foo", "bar", "oof", "rab", "foobar"]
%timeit anagram_counter(x)
# 27.2 μs ± 1.4 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit anagram_counter2(x)
# 9.71 μs ± 656 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit anagram_counter3(x)
# 11.9 μs ± 492 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit anagram_counter_op(x)
# 25.6 μs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And for some larger inputs:

import random
import string
import numpy as np
# increasing number of words, always 5 letters
x1 = [["".join(random.choices(string.ascii_lowercase, k=5)) for _ in range(n)]
 for n in np.logspace(1, 4, num=10, dtype=int)]
# increasing length of words, always 500 words
x2 = [["".join(random.choices(string.ascii_lowercase, k=n)) for _ in range(500)]
 for n in np.logspace(1, 4, num=10, dtype=int)]

enter image description here

(Note that both axis are logarithmic on both plots.)

Stack Exchange Network

Number of anagrams in an array of words (Codewars)

My Code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Number of anagrams in an array of words (Codewars)

My Code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions