Here is the Euler problem referenced, it says:
The n\$^{th}\$ term of the sequence of triangle numbers is given by, t\$_n\$ = 1⁄2n(n+1); so the first ten triangle numbers are:
1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...
By converting each letter in a word to a number corresponding to its alphabetical position and adding these values we form a word value. For example, the word value for SKY is 19 +たす 11 +たす 25 =わ 55 = t\$_{10}\$. If the word value is a triangle number then we shall call the word a triangle word.
Using words.txt (right click and 'Save Link/Target As...'), a 16K text file containing nearly two-thousand common English words, how many are triangle words?
My solution is as follows:
import re
import string
#generate a key to map the letters in our strings to
letters = string.ascii_uppercase
letters_map = {letter: idx for idx, letter in enumerate(string.ascii_uppercase, start=1)}
#open the file and parse on quoted values
with open('words.txt') as f:
words = sorted(re.findall(r'"(.*?)"', f.read()))
#max word score can only be Z*len(n), recall 'Z' : 26 in letters_map
max_word_score = len(max(words, key=len))*26
#generate a list of triangle_numbers to search for using max_score as upper bound
n = 1
triangle_numbers = []
while 1/2.0*n*(n+1) <= max_word_score:
triangle_numbers.append(1/2.0*n*(n+1))
n += 1
#finally use a nested for loop to return number of triangle numbers in our now summed words
count = 0
for word in words:
if sum(letters_map[char] for char in word) in triangle_numbers:
count += 1
%%timeit
100 loops, best of 3: 3.67 ms per loop
I'm curious if there are any ways I can rewrite this with similar or better performance so that it is more pythonic. I'm curious about a few things more specifically:
1) Is it possible to rewrite this better using itertools.ifilter
and itertools.imap
using letters_map
and the filter being the list of triangle_numbers
, to greater affect than simply using a nested for
loop and an if
statement to filter?
2) Is my while
construction extremely not pythonic? It felt awkward writing it the way it was and I wasn't sure what a better way of operating under that framework would be (needing to count up to some arbitrary ceiling, in this case max_word_score
.)
3) Should I be getting in the habit of wrapping these items in mini-functions in order to make it more readable/replicable? Would that end up slowing runtime down? What sort of effect would that have in general.
4) Any other pythonizations I can use to better effect?
-
\$\begingroup\$ I think your while loop can be improved like this max_triangle = 0.5 * max_word_score * (max_word_score + 1) triangle_numbers = [ 0.5*k*(k + 1) for k in range(max_triangle)] \$\endgroup\$N3buchadnezzar– N3buchadnezzar2015年10月14日 13:06:57 +00:00Commented Oct 14, 2015 at 13:06
2 Answers 2
Unnecessary floating point
Triangular numbers are always integers. n(n+1)
is always even because either n
or n+1
is. Thus you don't need to do floating point math here:
n = 1
triangle_numbers = []
while 1/2.0*n*(n+1) <= max_word_score:
triangle_numbers.append(1/2.0*n*(n+1))
n += 1
You can simply write n*(n+1)/2
. This drops your running time from 449ms (on 100 iterations) to 423ms.
Unnecessary sorting
You assign words = sorted(...)
. Nowhere in your code to you need it to be sorted, so that's an easy thing to drop. That gets us down to 405ms.
Checking for triangular numbers
You're building up a list of triangular numbers up to the maximum possible. Logically, this is valid. However, we can do better. A set
has much better performance for checking than a list does. Just making that change (declaring triangle_numbers
as set()
and using add()
) gets us down to 321ms.
Generator Expressions
Generator expressions are actually slower than loops (fact), even if they're far easier to read (in my opinion). So when you really care about cycles, you'll want to rewrite your code to be:
count = 0
for word in words:
value = 0
for char in word:
value += letters_map[char]
if value in triangle_numbers:
count += 1
return count
That gets us down to 222ms.
File Reading
Use csvreader
to read in the words instead of re
:
words = []
with open('euler42.txt') as f:
for line in csv.reader(f, delimiter=',', quotechar='"'):
words.extend(line)
Just using the more specific tool for the job. 172ms.
-
\$\begingroup\$ Great catches Barry thanks! I always look forward to reading your responses. \$\endgroup\$mburke05– mburke052015年10月14日 14:21:19 +00:00Commented Oct 14, 2015 at 14:21
-
\$\begingroup\$ Also, and this is a guess a pretty in depth question. But is there any literature/good reading on why loops are faster than generator expressions. And what the difference between using things like xrange() vs. range() yields (no pun intended, ha) you in performance/when you'd use one over the other? \$\endgroup\$mburke05– mburke052015年10月14日 14:24:18 +00:00Commented Oct 14, 2015 at 14:24
-
\$\begingroup\$ @mburke05 In Python2.7,
range()
gives you a list, e.g.range(10)
returns[0,1,2,...,9]
.xrange()
is a generator that gives you each successive number on the fly, so it's far more efficient since it doesn't have to construct a list up front. In Python3, they fixedrange()
to do the smart thing. \$\endgroup\$Barry– Barry2015年10月14日 14:53:06 +00:00Commented Oct 14, 2015 at 14:53 -
\$\begingroup\$ @mburke05 Not sure of a good resource. \$\endgroup\$Barry– Barry2015年10月14日 14:54:47 +00:00Commented Oct 14, 2015 at 14:54
Your should use csv
to read the file.
import csv
with open('words.txt', 'rb') as csvfile:
words = [
word
for line in csv.reader(csvfile, delimiter=',', quotechar='"')
for word in line
]
It's much easier to understand what it is doing. And it's probably safer and faster.
You define letters
, however you use string.ascii_uppercase
instead.
You can put max_word_score
in the same 'block' as the while loop below it.
You can also re-arange 1/2.0*n*(n+1) <= max_word_score
to get the max n.
Python also has a style guide called PEP8, which says you should have a space either side of operators. 1 + 1
not 1+1
.
The only exception to this is when you want to show precedence. 1 + 2*3
.
You can change your for word in words
to one large comprehension, and use sum
. For better reliability.
max_word_score = len(max(words, key=len)) * 26
triangle_numbers = [
0.5 * n * (n + 1)
for n in xrange(int((2 * max_word_score + max_word_score)**0.5))
]
count = sum(
1
for word in words
if sum(letters_map[char] for char in word) in triangle_numbers
)
If you also assume that you are not going to get the longest English word you can change max_word_score = len(max(words, key=len)) * 26
to max_word_score = 1909 * 26
, which will allow the second longest English word.
And as they are 'common English words' you could change max_word_score
to 27 * 26
, which will allow the 9th longest English word.
And if you do use max_word_score = 30 * 26
, then you can reduce the memory usage.
You should also change letters_map
, max_word_score
and triangle_numbers
to uppercase variants, e.g. LETTERS_MAP
.
This is as they are constants.
If I were to do all the above then I would use the following program:
import csv
import string
MAX_TRIANGLE = 30 * 26
LETTERS_MAP = {letter: idx for idx, letter in enumerate(string.ascii_uppercase, start=1)}
TRIANGLE_NUMBERS = [
0.5 * n * (n + 1)
for n in xrange(int((2 * MAX_TRIANGLE) ** 0.5))
]
with open('words.txt', 'rb') as csvfile:
count = sum(
1
for line in csv.reader(csvfile, delimiter=',', quotechar='"')
for word in line
if sum(LETTERS_MAP[char] for char in word) in TRIANGLE_NUMBERS
)
If you want it to be faster than your original, then you can change MAX_TRIANGLE = 30 * 26
to MAX_TRIANGLE = 15 * 26
, bit dodgy IMO.
But this uses less memory, (unless csv
loads everything into memory) and is more readable.
To answer your questions:
I would only encorage
itertools
if you use generators. Which you don't. (Well, you have one that is fed intosum
...)It's not 'Pythonic' as the maths you use isn't. However personally I think a greedy
for
is better, (hence why my maths is an over estimate).If you wrapped them in a main that would be good. Wrapping them into functions, is always better than having no functions. But as this can be done with 3 constants, and 1 generator, I would say one function is fine.
I would say following PEP8 is your best bet for that.
Explore related questions
See similar questions with these tags.