5
\$\begingroup\$

Problem 22, here, asks the following:

Using names.txt (right click and 'Save Link/Target As...'), a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.

For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 +たす 15 +たす 12 +たす 9 +たす 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 ×ばつかける 53 = 49714.

What is the total of all the name scores in the file?

My solution is as follows:

import re
import string
def create_names_list(file_name):
 #create a names list parsing 
 names = open(file_name).read()
 names = re.findall(r'"(.*?)"', names)
 names.sort()
 return names
def name_weights(file_name):
 #fetch name-list
 names = create_names_list(file_name)
 #create a letters dictionary e.g. {'A' : 1, 'B' : 2, ... 'Z' : 26}
 letters = string.ascii_uppercase
 letters_map_reversed = dict(enumerate(letters))
 letters_map = {value: key+1 for key, value in letters_map_reversed.iteritems()}
 # sum all letters using letter score * index
 return sum(letters_map[char]*(names.index(name)+1) for name in names for char in name)

%timeit name_weights('names.txt')

1 loops, best of 3: 1.18 s per loop

I noticed in the thread for the solutions that many python solutions appeared to be in the 8-12ms range, 10x faster than my solution. As a result I was concerned with the performance of my solution.

Two things I've done to attempt to tweak my solution are changing the letters_map[char] portion to just be ord(char)-64 which I noticed a few people were using. It seems like a clever, certainly shorter, way of getting the number value that I hadn't thought of. It didn't alter performance though, which makes me think my problem is with the final expression that uses a nested for loop to multiply all the characters in all the words by the given weight and letter_mapping number. It seems as though other solutions made similar use of nested for loops, but I noticed that nobody had used index() in order to get the numeric index of the name in the list. I wonder if that is what is causing performance, but I suspect it may also just be my misunderstanding of how to structure these nested for loops optimally (for instance I could have used enumerate() on names in addition to on letters in order to get the index in a dictionary which would forgo the need to use index().

Aside: I've intentionally used regular expressions here instead of .strip() and .replace() in order to just get an idea for what regular expressions are and how they work (in addition to perhaps there being non-zero improvements because RE is implemented in C right?) The file parsing itself doesn't really effect performance I've noticed, but any input on what is standard python convention and when to use strip() and built-in string methods vs. python's re module is helpful there is very welcome.

General code criticism as well as specific syntax very welcome.

asked Oct 2, 2015 at 19:13
\$\endgroup\$

1 Answer 1

5
\$\begingroup\$

Here's your problem:

return sum(letters_map[char]*(names.index(name)+1) for name in names for char in name)
 ^^^^^^^^^^^^^^^^^^^

So as we walk down names, we calculate the value of the current name... and then start over and search for it from scratch to determine the index. That extra lookup turns our algorithm from \$O(n)\$ to \$O(n^2)\$!! And string search is expensive - for the 1000th name, we have to first check 999 other strings that are definitely going to be wrong. We don't need to search for name in names to find the index. Not only that, but then we re-search the index for every char in name! (Maybe Python is smart enough to cache this, maybe not). So really this is more like \$O(n^3)\$...

We can just keep track of the index as we go. That's what enumerate() is for:

return sum(letters_map[char] * idx
 for idx, name in enumerate(names, start=1)
 for char in name)

That one change takes your runtime over 10 iterations from 13.31s to 0.07s. Behold the power of not redoing a bunch of work.


Everything else is basically fine. Creating letters_map could be abridged to just:

letters_map = {letter: idx
 for idx, letter in enumerate(string.ascii_uppercase, start=1)}

That doesn't affect performance, just makes it a little easier to see what letters_map actually is. It turns out that dictionary lookup is fater than doing the name lookup on ord() within the loop is, so this is pretty speedy.

It's also typical to do file operations with a context manager. That is:

with open(file_name) as f:
 return sorted(re.findall(r'"(.*?)"', f.read()))

That is marginally slower, but more Pythonic in my opinion.

answered Oct 2, 2015 at 19:27
\$\endgroup\$
2
  • \$\begingroup\$ Awesome thanks Barry! I suspected as much but that's really helpful. So effectively searching for things in a list is memory exhaustive and if possible should be avoided when we can enumerate on something sorted. If we need to do something that returned the index of a list that could not be sorted, what would be the best mode of doing that? I imagine the answer is mostly "it depends" but wonder if there's a place to start, like index() here. \$\endgroup\$ Commented Oct 2, 2015 at 19:32
  • \$\begingroup\$ @mburke05 Good rule of thumb for performance optimization: don't do things you don't have to do. \$\endgroup\$ Commented Oct 2, 2015 at 19:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.