This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.
I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.
The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):
A 56 AB 7342 ABC 3 BC 116 C 5 CD 10 BCD 502 ABCD 23 D 132 DD 6
The list of phrases, or "wordlist", looks like this (ignore the numbers):
AAB 1123 DCDD 83
The output of the script would be:
Original Pois Makeup Freq_Max_Delta AAB A AB [AB, 7342][A, 56] 7398 DCDD D C DD [D, 132][DD, 6][C, 5] 143
In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]
def contains(small, big):
small_ = len(small)
for i in xrange(len(big) - small_ + 1):
if big[i:i + small_] == small:
return (i, i + small_)
return None
def find_best(word, corpas):
combos = {}
for corpa, frequency in corpas.items():
c = contains(corpa, word)
if c:
combos[word[c[0]:c[1]]] = frequency
return combos
def create_combination(combos, word):
if not combos:
return None
combo_keys = combos.keys()
word = sorted(word)
combinations_ = [
j
for i in range(len(combo_keys) + 1)
for j in itertools.combinations(combo_keys, i)
if sorted(''.join(j)) == word
]
if not combinations_:
return None
result = None
for combination in combinations_:
sub = [(v, combos[v]) for v in combination]
total = sum(map(operator.itemgetter(1), sub))
if not result or total > result[2]:
result = [combination, sub, total]
return result
def display_combination(combination, word):
if combination is None:
print '\t\t'.join([word, 'Nothing Found'])
return None
part_final = ''.join(
'[' + v[0] + ', ' + str(v[1]) + ']'
for v in combination[1]
)
print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
parser = ArgumentParser(description=__doc__)
parser.add_argument("-w", "--wordlist", help="", required=True)
parser.add_argument("-c", "--corpa", help="", required=True)
args = parser.parse_args()
corpas = read_corpa(args.corpa)
for word in read_words(args.wordlist):
combos = find_best(word, corpas)
results = create_combination(combos, word)
display_combination(results, word)
if __name__ == '__main__':
main()
-
\$\begingroup\$ Just for clarification, it sounds like you are trying to partition the word so as to achieve the maximum sum of score/frequencies? \$\endgroup\$Zack– Zack2016年03月01日 04:11:00 +00:00Commented Mar 1, 2016 at 4:11
-
\$\begingroup\$ Yes that's exactly what I'm doing. \$\endgroup\$thomascrha– thomascrha2016年03月01日 04:25:22 +00:00Commented Mar 1, 2016 at 4:25
-
\$\begingroup\$ i think you might be hitting the limits of single-threaded python, might be time to move up to a "big data" solution like pyspark... \$\endgroup\$Max Flander– Max Flander2016年03月01日 04:53:10 +00:00Commented Mar 1, 2016 at 4:53
3 Answers 3
It will be much faster to use Python's in
operator for your substring search rather than using your custom-built "contains" function.
In [15]: %timeit 'brown' in 'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 μs per loop
In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 μs per loop
Also I wonder if you could rewrite some of your custom functions as dictionary comprehensions, something like this:
for word in read_words(args.wordlist):
combos = {k:v for k,v in corpus if k in word}
results = {'Original': word,
'Pois': list(combos.keys())
'Makeup': combos.items()
'Freq_Max_Delta': sum(combox.values())}
print(results)
-
\$\begingroup\$ could you suggest how I could change the create_combinations method to allow the in operator \$\endgroup\$thomascrha– thomascrha2016年03月01日 23:03:09 +00:00Commented Mar 1, 2016 at 23:03
-
\$\begingroup\$ i thought this is what i did with "results" above, but i admit I don't really understand what create combinations does, I was just basing off of your example input and output. can you give an example of the input and output of create_combinations? \$\endgroup\$Max Flander– Max Flander2016年03月01日 23:15:17 +00:00Commented Mar 1, 2016 at 23:15
Sort your corpus
Before processing your wordlist, sort your corpus into a dictionary by character. For each character, if a corpus entry contains that letter, it gets added to the dictionary entry for that character.
[A]: A, AB, ABC, ABCD
[B]: ABC, BC, BCD, ABCD
[C]: BC, C, CD, BCD, ABCD
[D]: CD, BCD, ABCD, D, DD
etc
For each word you process, get the unique/distinct list of characters (for example, by throwing it into a set).
When checking for matches, you will now be checking against a much smaller list of corpas, which should speed things up a little
This is a lot of nested for loops. This is not pythons strong suite. I would say you may be best off just going to c++. You can also try running your program with pypy (it may just work, and be much faster).
As far as code modifications:
Don't think you're reading your file correctly and doing a lot of extra work as the result ... if words are spaced by tab, read the whole file and do a the split on the whole buffer vs word.split('\t')[0] -- lot of overhead (this removes the read_corpa AND read_word function)
--strike this-- if you fix bullet 1: You may want to see if you can modify your read_corpa fn to use regex instead of sequential string manipulation calls. Each sequential call that modifies a non-existant string, stores a new string in memory -- same reason you use .join() when combining strings
- !!! Do not print inside a for-loop EVER, if speed counts, build a string, or write to a file, dumping to console is very IO intensive and slows things down tremendously
- reduce function calls wherever possible... that 1M*X-for loops-adds up in the fn calls.
- anything you loop over in the global scope, such as itertools.combinations, should be brought into the function scope (once you dump all your code into one function) ... you need to reduce dictionary lookupp
ic = itertools.combinations and so on -- example
some modified code
def main():
parser = ArgumentParser(description=__doc__)
parser.add_argument("-w", "--wordlist", help="", required=True)
parser.add_argument("-c", "--corpa", help="", required=True)
args = parser.parse_args()
corpas = read_corpa(args.corpa)
ic = itertools.combinations # bring fn pointer into fn scope, reduces lookup
oi = operator.itemgetter # bring fn pointer into fn scope, reduces lookup
so = sorted # bring fn pointer into fn scope, reduces lookup
for word in read_words(args.wordlist):
combos = find_best(word, corpas)
if not combos:
# remove a fn call, bring code right into the loop
combo_keys = combos.keys()
word = so(word)
result = None
for combination in [ j for i in range(len(combo_keys) + 1) for j in ic(combo_keys, i) if so(''.join(j)) == word ]:
sub = [(v, combos[v]) for v in combination]
total = sum(map(oi(1), sub))
if not result or total > result[2]:
result = [combination, sub, total]
# ------------- missing a break condition?????
display_combination(result, word)
-
\$\begingroup\$ +1 because you are correct about
print
, if OP changes that it will be a massive speed up. I was going to -1 because: in (1) just a.split('\t')
doesn't work with the data and (2) will just consume more memory. Also what function are you on about in (5)? \$\endgroup\$2016年03月01日 13:59:37 +00:00Commented Mar 1, 2016 at 13:59 -
\$\begingroup\$ for (5), i already reduced the function calls in my modified code sample ... where it used to call create_combination - reduces the fn jump as well as pushing stack vars and such - w/ so many iterations, reducing the number of fn calls has a very tangible effect \$\endgroup\$pyInTheSky– pyInTheSky2016年03月01日 18:48:51 +00:00Commented Mar 1, 2016 at 18:48
-
\$\begingroup\$ scoping the global fn calls into the fn scope is also really beneficial, I had a project that had hard looping using the pil library pushpixel fn call, scoping the fn to the fn-scope cut seconds off my image adjustments \$\endgroup\$pyInTheSky– pyInTheSky2016年03月01日 18:51:02 +00:00Commented Mar 1, 2016 at 18:51
-
\$\begingroup\$ as for (2) - i new it would take more memory, but thought it would execute faster, seems this is not the case, as i've just run timeit to test, i'll adjust accordingly \$\endgroup\$pyInTheSky– pyInTheSky2016年03月01日 18:56:21 +00:00Commented Mar 1, 2016 at 18:56
-
\$\begingroup\$ That all makes sense, I mostly wanted to clarify (4/5), but you may wish to add the rational for, now, (4) and (5) as they make sense now, but not if I read your answer. And since comments are meant to be temporary. (If it came across that I would -1 for them my apology's.) \$\endgroup\$2016年03月01日 19:26:53 +00:00Commented Mar 1, 2016 at 19:26
Explore related questions
See similar questions with these tags.