Return to Question

replaced http://codereview.stackexchange.com/ with https://codereview.stackexchange.com/

edited Apr 13, 2017 at 12:40

This a continuation of a previous question previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.

This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.

deleted 193 characters in body; edited tags; edited title

Source Link

edited Mar 1, 2016 at 6:06

200_success

edited Mar 1, 2016 at 6:06

200_success

145.5k
22
190
479

Finding distinct combination character sets Korean word segmentation using frequency heuristic

This a continuation of a previously asked question (hereprevious question ). I wantedwant to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the overall speed of the code, the speed improvements aren't enough for my purposes.

I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a currentThe algorithm I'm writing that takesuses a corpa,corpus to analyze a list of charactersphrases, such that each phrase is split into constituent words in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets. Ifway that makes sense. I'll explain with an examplemaximizes its frequency score.

Let's say your corpa looks like thisThe corpus is represented as a list of Korean words and their frequencies (just notepretend that case does not matter as the actual data is in Korean, hence each characterletter represents a Korean character):

The wordlist islist of phrases, or "wordlist", looks like this (ignore the numbernumbers):

In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpacorpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.

Finding distinct combination character sets

This a continuation of a previously asked question (here ). I wanted to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the overall speed of the code, the speed improvements aren't enough for my purposes.

I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets. If that makes sense. I'll explain with an example.

Let's say your corpa looks like this (just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character):

The wordlist is (ignore the number):

In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpa and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.

Korean word segmentation using frequency heuristic

I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.

The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):

The list of phrases, or "wordlist", looks like this (ignore the numbers):

In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.

deleted 6 characters in body; edited title

Source Link

edited Mar 1, 2016 at 4:36

Jamal

edited Mar 1, 2016 at 4:36

Jamal

35.2k
13
134
238

Improving large data set processing when finding Finding distinct combination character sets

I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets... ifIf that makes sense. I'll explain with an example. So lets

Let's say your corpa looks like this. (Justjust note that case does not matter as the actual data is in Korean, hence each character represents a Korean character.):

A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6

A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6

AAB 1123
DCDD 83

AAB 1123
DCDD 83

So theThe output of the script would be:

Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143

Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143

In the previous question, there are some sample inputs which I am using. If you want the full data set please PM me and I can send them to you. The biggest problem is the size of the data sets. The corpa and wordlist have 1M+ entries in each file. Currently itIt's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours plushours+ to process. Below is the code.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
 print 'Reading Corpa....'
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
 
def read_words(file_name):
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 for word in f:
 yield word.split('\t')[0]
def contains(small, big):
 small_ = len(small)
 for i in xrange(len(big) - small_ + 1):
 if big[i:i + small_] == small:
 return (i, i + small_)
 return None
def find_best(word, corpas):
 combos = {}
 for corpa, frequency in corpas.items():
 c = contains(corpa, word)
 
 if c:
 combos[word[c[0]:c[1]]] = frequency
 
 return combos
def create_combination(combos, word):
 if not combos:
 return None
 combo_keys = combos.keys()
 word = sorted(word)
 combinations_ = [
 j
 for i in range(len(combo_keys) + 1)
 for j in itertools.combinations(combo_keys, i)
 if sorted(''.join(j)) == word
 ]
 if not combinations_:
 return None
 result = None
 for combination in combinations_:
 sub = [(v, combos[v]) for v in combination]
 total = sum(map(operator.itemgetter(1), sub))
 if not result or total > result[2]:
 result = [combination, sub, total]
 return result
def display_combination(combination, word):
 if combination is None:
 print '\t\t'.join([word, 'Nothing Found'])
 return None
 part_final = ''.join(
 '[' + v[0] + ', ' + str(v[1]) + ']'
 for v in combination[1]
 )
 print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
 parser = ArgumentParser(description=__doc__)
 parser.add_argument("-w", "--wordlist", help="", required=True)
 parser.add_argument("-c", "--corpa", help="", required=True)
 args = parser.parse_args()
 corpas = read_corpa(args.corpa)
 for word in read_words(args.wordlist):
 combos = find_best(word, corpas)
 results = create_combination(combos, word)
 display_combination(results, word)
if __name__ == '__main__':
 main()

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
 print 'Reading Corpa....'
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
 
def read_words(file_name):
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 for word in f:
 yield word.split('\t')[0]
def contains(small, big):
 small_ = len(small)
 for i in xrange(len(big) - small_ + 1):
 if big[i:i + small_] == small:
 return (i, i + small_)
 return None
def find_best(word, corpas):
 combos = {}
 for corpa, frequency in corpas.items():
 c = contains(corpa, word)
 
 if c:
 combos[word[c[0]:c[1]]] = frequency
 
 return combos
def create_combination(combos, word):
 if not combos:
 return None
 combo_keys = combos.keys()
 word = sorted(word)
 combinations_ = [
 j
 for i in range(len(combo_keys) + 1)
 for j in itertools.combinations(combo_keys, i)
 if sorted(''.join(j)) == word
 ]
 if not combinations_:
 return None
 result = None
 for combination in combinations_:
 sub = [(v, combos[v]) for v in combination]
 total = sum(map(operator.itemgetter(1), sub))
 if not result or total > result[2]:
 result = [combination, sub, total]
 return result
def display_combination(combination, word):
 if combination is None:
 print '\t\t'.join([word, 'Nothing Found'])
 return None
 part_final = ''.join(
 '[' + v[0] + ', ' + str(v[1]) + ']'
 for v in combination[1]
 )
 print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
 parser = ArgumentParser(description=__doc__)
 parser.add_argument("-w", "--wordlist", help="", required=True)
 parser.add_argument("-c", "--corpa", help="", required=True)
 args = parser.parse_args()
 corpas = read_corpa(args.corpa)
 for word in read_words(args.wordlist):
 combos = find_best(word, corpas)
 results = create_combination(combos, word)
 display_combination(results, word)
if __name__ == '__main__':
 main()

Improving large data set processing when finding distinct combination character sets

I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets... if that makes sense. I'll explain with an example. So lets say your corpa looks like this. (Just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character.)

A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6

AAB 1123
DCDD 83

So the output of the script would be:

Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143

In the previous question there are some sample inputs which I am using. If you want the full data set please PM me and I can send them to you. The biggest problem is the size of the data sets. The corpa and wordlist have 1M+ entries in each file. Currently it taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours plus to process. Below is the code.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
 print 'Reading Corpa....'
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
 
def read_words(file_name):
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 for word in f:
 yield word.split('\t')[0]
def contains(small, big):
 small_ = len(small)
 for i in xrange(len(big) - small_ + 1):
 if big[i:i + small_] == small:
 return (i, i + small_)
 return None
def find_best(word, corpas):
 combos = {}
 for corpa, frequency in corpas.items():
 c = contains(corpa, word)
 
 if c:
 combos[word[c[0]:c[1]]] = frequency
 
 return combos
def create_combination(combos, word):
 if not combos:
 return None
 combo_keys = combos.keys()
 word = sorted(word)
 combinations_ = [
 j
 for i in range(len(combo_keys) + 1)
 for j in itertools.combinations(combo_keys, i)
 if sorted(''.join(j)) == word
 ]
 if not combinations_:
 return None
 result = None
 for combination in combinations_:
 sub = [(v, combos[v]) for v in combination]
 total = sum(map(operator.itemgetter(1), sub))
 if not result or total > result[2]:
 result = [combination, sub, total]
 return result
def display_combination(combination, word):
 if combination is None:
 print '\t\t'.join([word, 'Nothing Found'])
 return None
 part_final = ''.join(
 '[' + v[0] + ', ' + str(v[1]) + ']'
 for v in combination[1]
 )
 print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
 parser = ArgumentParser(description=__doc__)
 parser.add_argument("-w", "--wordlist", help="", required=True)
 parser.add_argument("-c", "--corpa", help="", required=True)
 args = parser.parse_args()
 corpas = read_corpa(args.corpa)
 for word in read_words(args.wordlist):
 combos = find_best(word, corpas)
 results = create_combination(combos, word)
 display_combination(results, word)
if __name__ == '__main__':
 main()

Finding distinct combination character sets

Let's say your corpa looks like this (just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character):

A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6

AAB 1123
DCDD 83

The output of the script would be:

Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
 print 'Reading Corpa....'
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
 
def read_words(file_name):
 with codecs.open(file_name, 'r', 'UTF-8') as f:
 for word in f:
 yield word.split('\t')[0]
def contains(small, big):
 small_ = len(small)
 for i in xrange(len(big) - small_ + 1):
 if big[i:i + small_] == small:
 return (i, i + small_)
 return None
def find_best(word, corpas):
 combos = {}
 for corpa, frequency in corpas.items():
 c = contains(corpa, word)
 
 if c:
 combos[word[c[0]:c[1]]] = frequency
 
 return combos
def create_combination(combos, word):
 if not combos:
 return None
 combo_keys = combos.keys()
 word = sorted(word)
 combinations_ = [
 j
 for i in range(len(combo_keys) + 1)
 for j in itertools.combinations(combo_keys, i)
 if sorted(''.join(j)) == word
 ]
 if not combinations_:
 return None
 result = None
 for combination in combinations_:
 sub = [(v, combos[v]) for v in combination]
 total = sum(map(operator.itemgetter(1), sub))
 if not result or total > result[2]:
 result = [combination, sub, total]
 return result
def display_combination(combination, word):
 if combination is None:
 print '\t\t'.join([word, 'Nothing Found'])
 return None
 part_final = ''.join(
 '[' + v[0] + ', ' + str(v[1]) + ']'
 for v in combination[1]
 )
 print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
 parser = ArgumentParser(description=__doc__)
 parser.add_argument("-w", "--wordlist", help="", required=True)
 parser.add_argument("-c", "--corpa", help="", required=True)
 args = parser.parse_args()
 corpas = read_corpa(args.corpa)
 for word in read_words(args.wordlist):
 combos = find_best(word, corpas)
 results = create_combination(combos, word)
 display_combination(results, word)
if __name__ == '__main__':
 main()

Source Link

asked Mar 1, 2016 at 3:07

thomascrha

asked Mar 1, 2016 at 3:07

thomascrha

lang-py