9
\$\begingroup\$

Task:

Given a dictionary of words, and a set of characters, judge if all the characters can form the words from the dictionary, without any characters left. For example, given the dictionary {hello, world, is, my, first, program}, if the characters set is "iiifrssst", you should return 'true' because you can form {is, is, first} from the set; if the character set is "eiifrsst", you should return 'false' because you cannot use all the characters from the set.

P.S. there may be tens of thousands of words in the dictionary, and the chars set length could be up to hundreds, so I really need some efficient algorithm.

Source: http://www.careercup.com/question?id=5841660122497024

Here is my solution in Python:

from collections import Counter
from itertools import combinations_with_replacement
import copy
# Check if word can be made up with the characters in char_count
def valid_word(word):
 # build character count for word
 word_char_count = Counter(word)
 # check if character count for word
 for char in word_char_count.keys():
 if char not in char_count or word_char_count[char] > char_count[char]:
 return False
 return True
# Only return words that can be made up from characters with char_set
def prune_words(words):
 res = []
 for word in words:
 if valid_word(word):
 res.append(word)
 return res
def main(words, char_set):
 global char_count
 char_count = Counter(char_set) # count character occurrences
 # Prune words that cannot possible be made from the set of characters
 words = prune_words(words)
 words_len = [len(word) for word in words]
 # Maximum number of r with generating combinations
 max_r = len(char_set) / min(words_len)
 # Generate permutations with the valid words
 for r in range(1, max_r + 1):
 for comb in combinations_with_replacement(words, r):
 # If the total length of the combination matches length of char_set,
 # do further checks
 if sum(map(len, comb)) == len(char_set):
 temp_count = copy.deepcopy(char_count)
 # Subtract char count of each word in combination from char_set
 for word in comb:
 temp_count.subtract(word)
 # If the remaining count are all zero, then we have found an
 # answer
 if all([count == 0 for count in temp_count.values()]):
 print comb
 return True
 return False
if __name__ == '__main__':
 print main(['hello', 'world', 'is', 'my', 'first', 'program'],
 'iiifrssst')

What improvements can be made to my code (e.g. algorithmic speed, make it more 'Python', style, etc etc)? Happy to hear any constructive comments!

200_success
146k22 gold badges190 silver badges479 bronze badges
asked Mar 30, 2014 at 6:45
\$\endgroup\$
1
  • 1
    \$\begingroup\$ My initial impression is that this sounds like a Set Cover problem, which would be NP-complete. \$\endgroup\$ Commented Mar 30, 2014 at 7:06

3 Answers 3

4
\$\begingroup\$
  • Since "the chars set length could be up to hundreds", it can easily contain enough letters to form any word in an English dictionary. In such a case prune_words would not prune any words. This makes me think the pruning is an example of premature optimization.
  • Instead of using Counter to check a match, I believe it would be faster to compare sorted lists of characters, i.e. compare sorted("".join(comb)) to a pre-computed sorted(char_set).
  • Copying a Counter by Counter(char_count) is faster than by copy.deepcopy.
  • To check if all counts are zeros you don't need the list comprehension as you can simply use if not any(temp_count.itervalues())
  • You check lots of combinations that are the wrong length. Even though the length check is the first thing inside the loop, it would be better to generate the combinations in a way that allows you to back off when the combined length exceeds the length of char_set. I would use recursion to explore the combinations.
  • A more clever approach might be found from representing the dictionary as a prefix tree.
answered Mar 31, 2014 at 16:01
\$\endgroup\$
3
\$\begingroup\$

Replace prune_words with a filter.

  • Replace words = prune_words(words) with words = filter(valid_word, words).
  • You can now remove prune_words.

Avoid Deepcopy

There doesn't appear to be any reason to preserve char_sets, so why not subtract directly from it?

answered Mar 30, 2014 at 20:38
\$\endgroup\$
2
  • \$\begingroup\$ Thanks. Just a question regarding "filter" - why do I need to change valid_word to return None instead of False? I read the docs but it's not clear to me what they mean. \$\endgroup\$ Commented Mar 31, 2014 at 2:16
  • \$\begingroup\$ It is not necessary to change valid_word to return None. \$\endgroup\$ Commented Mar 31, 2014 at 8:35
2
\$\begingroup\$

Because you don't really care about the order of the letters in the word, you could perform some simple preprocessing in order to deal with fewer words :

>>> with open('/etc/dictionaries-common/words', 'r') as f:
... words = {w.strip().lower() for w in f}
... 
>>> words_anagram = { ''.join(sorted(w)) for w in words}
>>> len(words)
97723
>>> len(words_anagram)
90635

It might not help much but it's so simple that it cannot hurt much neither.

answered Mar 31, 2014 at 16:15
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.