Korean word segmentation using frequency heuristic

Question 1

This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.

I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.

The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):

A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6

The list of phrases, or "wordlist", looks like this (ignore the numbers):

AAB 1123
DCDD 83

The output of the script would be:

Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143

In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
 print 'Reading Corpa....'
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 for word in f:
 yield word.split('\t')[0]
def contains(small, big):
 small_ = len(small)
 for i in xrange(len(big) - small_ + 1):
 if big[i:i + small_] == small:
 return (i, i + small_)
 return None
def find_best(word, corpas):
 combos = {}
 for corpa, frequency in corpas.items():
 c = contains(corpa, word)
 if c:
 combos[word[c[0]:c[1]]] = frequency
 return combos
def create_combination(combos, word):
 if not combos:
 return None
 combo_keys = combos.keys()
 word = sorted(word)
 combinations_ = [
 j
 for i in range(len(combo_keys) + 1)
 for j in itertools.combinations(combo_keys, i)
 if sorted(''.join(j)) == word
 ]
 if not combinations_:
 return None
 result = None
 for combination in combinations_:
 sub = [(v, combos[v]) for v in combination]
 total = sum(map(operator.itemgetter(1), sub))
 if not result or total > result[2]:
 result = [combination, sub, total]
 return result
def display_combination(combination, word):
 if combination is None:
 print '\t\t'.join([word, 'Nothing Found'])
 return None
 part_final = ''.join(
 '[' + v[0] + ', ' + str(v[1]) + ']'
 for v in combination[1]
 )
 print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
 parser = ArgumentParser(description=__doc__)
 parser.add_argument("-w", "--wordlist", help="", required=True)
 parser.add_argument("-c", "--corpa", help="", required=True)
 args = parser.parse_args()
 corpas = read_corpa(args.corpa)
 for word in read_words(args.wordlist):
 combos = find_best(word, corpas)
 results = create_combination(combos, word)
 display_combination(results, word)
if __name__ == '__main__':
 main()

Question 2

Just for clarification, it sounds like you are trying to partition the word so as to achieve the maximum sum of score/frequencies?

Question 3

Yes that's exactly what I'm doing.

Question 4

i think you might be hitting the limits of single-threaded python, might be time to move up to a "big data" solution like pyspark...

Question 5

It will be much faster to use Python's in operator for your substring search rather than using your custom-built "contains" function.

In [15]: %timeit 'brown' in 'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 μs per loop
In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 μs per loop

Also I wonder if you could rewrite some of your custom functions as dictionary comprehensions, something like this:

for word in read_words(args.wordlist):
 combos = {k:v for k,v in corpus if k in word}
 results = {'Original': word,
 'Pois': list(combos.keys())
 'Makeup': combos.items()
 'Freq_Max_Delta': sum(combox.values())}
 print(results)

Question 6

could you suggest how I could change the create_combinations method to allow the in operator

Question 7

i thought this is what i did with "results" above, but i admit I don't really understand what create combinations does, I was just basing off of your example input and output. can you give an example of the input and output of create_combinations?

Question 8

Sort your corpus

Before processing your wordlist, sort your corpus into a dictionary by character. For each character, if a corpus entry contains that letter, it gets added to the dictionary entry for that character.

[A]: A, AB, ABC, ABCD
[B]: ABC, BC, BCD, ABCD
[C]: BC, C, CD, BCD, ABCD
[D]: CD, BCD, ABCD, D, DD
etc

For each word you process, get the unique/distinct list of characters (for example, by throwing it into a set).

When checking for matches, you will now be checking against a much smaller list of corpas, which should speed things up a little

Question 9

This is a lot of nested for loops. This is not pythons strong suite. I would say you may be best off just going to c++. You can also try running your program with pypy (it may just work, and be much faster).

As far as code modifications:

Don't think you're reading your file correctly and doing a lot of extra work as the result ... if words are spaced by tab, read the whole file and do a the split on the whole buffer vs word.split('\t')[0] -- lot of overhead (this removes the read_corpa AND read_word function)
--strike this-- if you fix bullet 1: You may want to see if you can modify your read_corpa fn to use regex instead of sequential string manipulation calls. Each sequential call that modifies a non-existant string, stores a new string in memory -- same reason you use .join() when combining strings
!!! Do not print inside a for-loop EVER, if speed counts, build a string, or write to a file, dumping to console is very IO intensive and slows things down tremendously
reduce function calls wherever possible... that 1M*X-for loops-adds up in the fn calls.
anything you loop over in the global scope, such as itertools.combinations, should be brought into the function scope (once you dump all your code into one function) ... you need to reduce dictionary lookupp ic = itertools.combinations and so on -- example

some modified code

def main():
 parser = ArgumentParser(description=__doc__)
 parser.add_argument("-w", "--wordlist", help="", required=True)
 parser.add_argument("-c", "--corpa", help="", required=True)
 args = parser.parse_args()
 corpas = read_corpa(args.corpa)
 ic = itertools.combinations # bring fn pointer into fn scope, reduces lookup
 oi = operator.itemgetter # bring fn pointer into fn scope, reduces lookup
 so = sorted # bring fn pointer into fn scope, reduces lookup
 for word in read_words(args.wordlist):
 combos = find_best(word, corpas)
 if not combos:
 # remove a fn call, bring code right into the loop
 combo_keys = combos.keys()
 word = so(word)
 result = None
 for combination in [ j for i in range(len(combo_keys) + 1) for j in ic(combo_keys, i) if so(''.join(j)) == word ]:
 sub = [(v, combos[v]) for v in combination]
 total = sum(map(oi(1), sub))
 if not result or total > result[2]:
 result = [combination, sub, total]
# ------------- missing a break condition?????
 display_combination(result, word)

Question 10

+1 because you are correct about print, if OP changes that it will be a massive speed up. I was going to -1 because: in (1) just a .split('\t') doesn't work with the data and (2) will just consume more memory. Also what function are you on about in (5)?

Question 11

for (5), i already reduced the function calls in my modified code sample ... where it used to call create_combination - reduces the fn jump as well as pushing stack vars and such - w/ so many iterations, reducing the number of fn calls has a very tangible effect

Question 12

scoping the global fn calls into the fn scope is also really beneficial, I had a project that had hard looping using the pil library pushpixel fn call, scoping the fn to the fn-scope cut seconds off my image adjustments

Question 13

as for (2) - i new it would take more memory, but thought it would execute faster, seems this is not the case, as i've just run timeit to test, i'll adjust accordingly

Question 14

That all makes sense, I mostly wanted to clarify (4/5), but you may wish to add the rational for, now, (4) and (5) as they make sense now, but not if I read your answer. And since comments are meant to be temporary. (If it came across that I would -1 for them my apology's.)

Max Flander Max Flander 1613 bronze badges · Answer 1 · 2016-03-01 05:00:59Z

It will be much faster to use Python's in operator for your substring search rather than using your custom-built "contains" function.

In [15]: %timeit 'brown' in 'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 μs per loop
In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 μs per loop

Also I wonder if you could rewrite some of your custom functions as dictionary comprehensions, something like this:

for word in read_words(args.wordlist):
 combos = {k:v for k,v in corpus if k in word}
 results = {'Original': word,
 'Pois': list(combos.keys())
 'Makeup': combos.items()
 'Freq_Max_Delta': sum(combox.values())}
 print(results)

could you suggest how I could change the create_combinations method to allow the in operator
i thought this is what i did with "results" above, but i admit I don't really understand what create combinations does, I was just basing off of your example input and output. can you give an example of the input and output of create_combinations?

Zack Zack 1,90311 silver badges13 bronze badges · Answer 2 · 2016-03-01 13:59:36Z

Sort your corpus

Before processing your wordlist, sort your corpus into a dictionary by character. For each character, if a corpus entry contains that letter, it gets added to the dictionary entry for that character.

[A]: A, AB, ABC, ABCD
[B]: ABC, BC, BCD, ABCD
[C]: BC, C, CD, BCD, ABCD
[D]: CD, BCD, ABCD, D, DD
etc

For each word you process, get the unique/distinct list of characters (for example, by throwing it into a set).

When checking for matches, you will now be checking against a much smaller list of corpas, which should speed things up a little

pyInTheSky pyInTheSky 2591 silver badge5 bronze badges · Answer 3 · 2016-03-01 04:31:15Z

This is a lot of nested for loops. This is not pythons strong suite. I would say you may be best off just going to c++. You can also try running your program with pypy (it may just work, and be much faster).

As far as code modifications:

Don't think you're reading your file correctly and doing a lot of extra work as the result ... if words are spaced by tab, read the whole file and do a the split on the whole buffer vs word.split('\t')[0] -- lot of overhead (this removes the read_corpa AND read_word function)
--strike this-- if you fix bullet 1: You may want to see if you can modify your read_corpa fn to use regex instead of sequential string manipulation calls. Each sequential call that modifies a non-existant string, stores a new string in memory -- same reason you use .join() when combining strings
!!! Do not print inside a for-loop EVER, if speed counts, build a string, or write to a file, dumping to console is very IO intensive and slows things down tremendously
reduce function calls wherever possible... that 1M*X-for loops-adds up in the fn calls.
anything you loop over in the global scope, such as itertools.combinations, should be brought into the function scope (once you dump all your code into one function) ... you need to reduce dictionary lookupp ic = itertools.combinations and so on -- example

some modified code

def main():
 parser = ArgumentParser(description=__doc__)
 parser.add_argument("-w", "--wordlist", help="", required=True)
 parser.add_argument("-c", "--corpa", help="", required=True)
 args = parser.parse_args()
 corpas = read_corpa(args.corpa)
 ic = itertools.combinations # bring fn pointer into fn scope, reduces lookup
 oi = operator.itemgetter # bring fn pointer into fn scope, reduces lookup
 so = sorted # bring fn pointer into fn scope, reduces lookup
 for word in read_words(args.wordlist):
 combos = find_best(word, corpas)
 if not combos:
 # remove a fn call, bring code right into the loop
 combo_keys = combos.keys()
 word = so(word)
 result = None
 for combination in [ j for i in range(len(combo_keys) + 1) for j in ic(combo_keys, i) if so(''.join(j)) == word ]:
 sub = [(v, combos[v]) for v in combination]
 total = sum(map(oi(1), sub))
 if not result or total > result[2]:
 result = [combination, sub, total]
# ------------- missing a break condition?????
 display_combination(result, word)

+1 because you are correct about print, if OP changes that it will be a massive speed up. I was going to -1 because: in (1) just a .split('\t') doesn't work with the data and (2) will just consume more memory. Also what function are you on about in (5)?
for (5), i already reduced the function calls in my modified code sample ... where it used to call create_combination - reduces the fn jump as well as pushing stack vars and such - w/ so many iterations, reducing the number of fn calls has a very tangible effect
scoping the global fn calls into the fn scope is also really beneficial, I had a project that had hard looping using the pil library pushpixel fn call, scoping the fn to the fn-scope cut seconds off my image adjustments
as for (2) - i new it would take more memory, but thought it would execute faster, seems this is not the case, as i've just run timeit to test, i'll adjust accordingly
That all makes sense, I mostly wanted to clarify (4/5), but you may wish to add the rational for, now, (4) and (5) as they make sense now, but not if I read your answer. And since comments are meant to be temporary. (If it came across that I would -1 for them my apology's.)

Stack Exchange Network

Korean word segmentation using frequency heuristic

3 Answers 3

Sort your corpus

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Korean word segmentation using frequency heuristic

3 Answers 3

Sort your corpus

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions