Question
I feel like my code could be more elegant/more pythonic/faster, but I can't think of too much more to improve. So, I've come to the internet to see what more can be done with standard python.
What my code does
My code counts anagrams for each word in a list of words. For example:
post, spot stop, tops
are four of the words in my list and since all are anagrams of each other, each word should map to the number 4 in a dictionary. Specifically I'm counting the size of the anagram group each word would fit in. The output for a few words would look something like the following then:
{'1-2-3': 1,
'1980s': 1,
'1990s': 1,
...
...
'top': 1,
'topic': 2,
'topics': 1,
'topped': 1,
'tops': 4,
'tory': 2,
'total': 1,
'totals': 1,
'touch': 1,
'tough': 2,
...
...
'zone': 1,
'zones': 1,
'zurich': 1}
My code
from itertools import permutations
from collections import Counter
def get_variants(word):
return map(lambda t: "".join(t), set(permutations(word)))
def count_anagrams(words):
anagram_counts = {w: 1 for w in words}
word_counters = list(map(Counter, words))
for i, (word, counter) in enumerate(zip(words, word_counters)):
for other_word, other_counter in zip(words[i+1:], word_counters[i+1:]):
if counter == other_counter:
anagram_counts[word] += 1
anagram_counts[other_word] += 1
return anagram_counts
-
\$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Mast– Mast ♦2020年05月05日 14:04:17 +00:00Commented May 5, 2020 at 14:04
2 Answers 2
Performance (Language independent)
The permutations scale factorially with the word length and your loop code squares quadratically with the number of words. Both are really bad scaling factors. The nice thing to realize, that all equivalent anagrams map to the same multiset. There are python libraries for multisets, but already with the built-in tools you can go a long way. Two strings are equivalent, under "Anagram-relation", if they are the same after sorting. We will treat the sorted string as representation of our Anagram equivalence class. Since strings are immutable you can even use these represantions directly as dictionary keys.
Your code becomes then quite short
from collections import defaultdict
def count_anagrams(words):
result = defaultdict(list)
for word in words:
result[''.join(sorted(word))].append(word)
return dict(result)
This scales linearly with word number and (n * log(n)) with word length. (Under the assumption that dictionary insertion is O(1) and sorted
uses a reasonable sorting algorithm.).
The output of
count_anagrams(['stop', 'test', 'post'])
is then
{'opst': ['stop', 'post'], 'estt': ['test']}
You can change it to your needs by transforming it to your desired output with len
etc.
If you want to convert it to the exact same form as you have it, one example function would be:
def convert_output(words_in_anagram_class):
return {word: len(words)
for words in words_in_anagram_class.values()
for word in words}
Then you can call convert_output(count_anagrams(words))
. If you want to, you can combine these two functions into one.
(Although this is IMHO a far less useful representation of your data.)
Small stuff (Python nitpicks)
map
can and should be replaced with comprehensions. Especially if you cast the result into a list.
[f(x) for x in iterable]
is far nicer to read than list(map(f, iterable))
.
If you really want a generator, there are also generator expressions
(f(x) for x in iterable)
.
-
1\$\begingroup\$ Why use a sorted string rather than a counter? \$\endgroup\$Luapulu– Luapulu2020年05月05日 12:32:55 +00:00Commented May 5, 2020 at 12:32
-
2\$\begingroup\$ Conceptional problem: I don't get exactly what you want. Could you expand? Technical problem: You cannot use a counter as dict key. \$\endgroup\$mcocdawc– mcocdawc2020年05月05日 13:14:05 +00:00Commented May 5, 2020 at 13:14
-
\$\begingroup\$ So, my code has the original word, rather than the sorted version as the key and then the corresponding value would be the length (or the list of anagrams in your case). Using that key, you could either compare counters, or the sorted word. Why use one rather than the other? In general, what do I do if I want the actual word (unsorted) as the key? \$\endgroup\$Luapulu– Luapulu2020年05月05日 13:25:26 +00:00Commented May 5, 2020 at 13:25
-
3\$\begingroup\$ Nice solution. To get the format the OP wants, you just need 2 passes.. \$\endgroup\$Maarten Fabré– Maarten Fabré2020年05月05日 15:29:08 +00:00Commented May 5, 2020 at 15:29
-
\$\begingroup\$ @MaartenFabré Thank you very much. ;-) \$\endgroup\$mcocdawc– mcocdawc2020年05月05日 16:53:37 +00:00Commented May 5, 2020 at 16:53
Additionally to mcocdawc's answer, since what I mean is too much for a comment
You need an intermediate stage.
You used the list of Counters for this. But then to find anagrams in the list is expensive. A dict would be a better way, and a collections.Counter
is especially made for this purpose. Now you just need to find an acceptable representation of your word to discern anagrams. mcocdawc suggested the sorted string, since 2 anagrams result in the same response if you sort the letters. An alternative is the frozenset
of the items of a Counter. I suspect the sorted list will be faster, but you'll need to test that.
Based on mcocdawc, but without the intermediate lists:
def count_anagrams(words):
counter = Counter()
intermediate = {}
for word in words:
intermediate_key = ''.join(sorted(word))
# intermediate_key = tuple(sorted(word)) # alternatively
# intermediate_key = frozenset(Counter(word).items()) # alternatively
counter[intermediate_key] += 1
intermediate[word] = intermediate_key
return {
word: counter[intermediate_key]
for word, intermediate_key in intermediate.items()
}
I'm not saying this is better/faster than mcocdawc's answer, but I think the intermediate structure is closer to