This a continuation of a previous question previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.
This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.
This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.
- 145.5k
- 22
- 190
- 479
Finding distinct combination character sets Korean word segmentation using frequency heuristic
This a continuation of a previously asked question (hereprevious question ). I wantedwant to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the overall speed of the code, the speed improvements aren't enough for my purposes.
I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a currentThe algorithm I'm writing that takesuses a corpa,corpus to analyze a list of charactersphrases, such that each phrase is split into constituent words in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets. Ifway that makes sense. I'll explain with an examplemaximizes its frequency score.
Let's say your corpa looks like thisThe corpus is represented as a list of Korean words and their frequencies (just notepretend that case does not matter as the actual data is in Korean, hence each characterletter represents a Korean character):
The wordlist islist of phrases, or "wordlist", looks like this (ignore the numbernumbers):
In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpacorpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.
Finding distinct combination character sets
This a continuation of a previously asked question (here ). I wanted to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the overall speed of the code, the speed improvements aren't enough for my purposes.
I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets. If that makes sense. I'll explain with an example.
Let's say your corpa looks like this (just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character):
The wordlist is (ignore the number):
In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpa and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.
Korean word segmentation using frequency heuristic
This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.
I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.
The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):
The list of phrases, or "wordlist", looks like this (ignore the numbers):
In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.
Improving large data set processing when finding Finding distinct combination character sets
I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets... ifIf that makes sense. I'll explain with an example. So lets
Let's say your corpa looks like this. (Justjust note that case does not matter as the actual data is in Korean, hence each character represents a Korean character.):
A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6
A 56 AB 7342 ABC 3 BC 116 C 5 CD 10 BCD 502 ABCD 23 D 132 DD 6
AAB 1123
DCDD 83
AAB 1123 DCDD 83
So theThe output of the script would be:
Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143
Original Pois Makeup Freq_Max_Delta AAB A AB [AB, 7342][A, 56] 7398 DCDD D C DD [D, 132][DD, 6][C, 5] 143
In the previous question, there are some sample inputs which I am using. If you want the full data set please PM me and I can send them to you. The biggest problem is the size of the data sets. The corpa and wordlist have 1M+ entries in each file. Currently itIt's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours plushours+ to process. Below is the code.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]
def contains(small, big):
small_ = len(small)
for i in xrange(len(big) - small_ + 1):
if big[i:i + small_] == small:
return (i, i + small_)
return None
def find_best(word, corpas):
combos = {}
for corpa, frequency in corpas.items():
c = contains(corpa, word)
if c:
combos[word[c[0]:c[1]]] = frequency
return combos
def create_combination(combos, word):
if not combos:
return None
combo_keys = combos.keys()
word = sorted(word)
combinations_ = [
j
for i in range(len(combo_keys) + 1)
for j in itertools.combinations(combo_keys, i)
if sorted(''.join(j)) == word
]
if not combinations_:
return None
result = None
for combination in combinations_:
sub = [(v, combos[v]) for v in combination]
total = sum(map(operator.itemgetter(1), sub))
if not result or total > result[2]:
result = [combination, sub, total]
return result
def display_combination(combination, word):
if combination is None:
print '\t\t'.join([word, 'Nothing Found'])
return None
part_final = ''.join(
'[' + v[0] + ', ' + str(v[1]) + ']'
for v in combination[1]
)
print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
parser = ArgumentParser(description=__doc__)
parser.add_argument("-w", "--wordlist", help="", required=True)
parser.add_argument("-c", "--corpa", help="", required=True)
args = parser.parse_args()
corpas = read_corpa(args.corpa)
for word in read_words(args.wordlist):
combos = find_best(word, corpas)
results = create_combination(combos, word)
display_combination(results, word)
if __name__ == '__main__':
main()
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]
def contains(small, big):
small_ = len(small)
for i in xrange(len(big) - small_ + 1):
if big[i:i + small_] == small:
return (i, i + small_)
return None
def find_best(word, corpas):
combos = {}
for corpa, frequency in corpas.items():
c = contains(corpa, word)
if c:
combos[word[c[0]:c[1]]] = frequency
return combos
def create_combination(combos, word):
if not combos:
return None
combo_keys = combos.keys()
word = sorted(word)
combinations_ = [
j
for i in range(len(combo_keys) + 1)
for j in itertools.combinations(combo_keys, i)
if sorted(''.join(j)) == word
]
if not combinations_:
return None
result = None
for combination in combinations_:
sub = [(v, combos[v]) for v in combination]
total = sum(map(operator.itemgetter(1), sub))
if not result or total > result[2]:
result = [combination, sub, total]
return result
def display_combination(combination, word):
if combination is None:
print '\t\t'.join([word, 'Nothing Found'])
return None
part_final = ''.join(
'[' + v[0] + ', ' + str(v[1]) + ']'
for v in combination[1]
)
print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
parser = ArgumentParser(description=__doc__)
parser.add_argument("-w", "--wordlist", help="", required=True)
parser.add_argument("-c", "--corpa", help="", required=True)
args = parser.parse_args()
corpas = read_corpa(args.corpa)
for word in read_words(args.wordlist):
combos = find_best(word, corpas)
results = create_combination(combos, word)
display_combination(results, word)
if __name__ == '__main__':
main()
Improving large data set processing when finding distinct combination character sets
I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets... if that makes sense. I'll explain with an example. So lets say your corpa looks like this. (Just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character.)
A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6
AAB 1123
DCDD 83
So the output of the script would be:
Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143
In the previous question there are some sample inputs which I am using. If you want the full data set please PM me and I can send them to you. The biggest problem is the size of the data sets. The corpa and wordlist have 1M+ entries in each file. Currently it taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours plus to process. Below is the code.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]
def contains(small, big):
small_ = len(small)
for i in xrange(len(big) - small_ + 1):
if big[i:i + small_] == small:
return (i, i + small_)
return None
def find_best(word, corpas):
combos = {}
for corpa, frequency in corpas.items():
c = contains(corpa, word)
if c:
combos[word[c[0]:c[1]]] = frequency
return combos
def create_combination(combos, word):
if not combos:
return None
combo_keys = combos.keys()
word = sorted(word)
combinations_ = [
j
for i in range(len(combo_keys) + 1)
for j in itertools.combinations(combo_keys, i)
if sorted(''.join(j)) == word
]
if not combinations_:
return None
result = None
for combination in combinations_:
sub = [(v, combos[v]) for v in combination]
total = sum(map(operator.itemgetter(1), sub))
if not result or total > result[2]:
result = [combination, sub, total]
return result
def display_combination(combination, word):
if combination is None:
print '\t\t'.join([word, 'Nothing Found'])
return None
part_final = ''.join(
'[' + v[0] + ', ' + str(v[1]) + ']'
for v in combination[1]
)
print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
parser = ArgumentParser(description=__doc__)
parser.add_argument("-w", "--wordlist", help="", required=True)
parser.add_argument("-c", "--corpa", help="", required=True)
args = parser.parse_args()
corpas = read_corpa(args.corpa)
for word in read_words(args.wordlist):
combos = find_best(word, corpas)
results = create_combination(combos, word)
display_combination(results, word)
if __name__ == '__main__':
main()
Finding distinct combination character sets
I'll reiterate the problem, but please feel free to look at the previous question. I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets. If that makes sense. I'll explain with an example.
Let's say your corpa looks like this (just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character):
A 56 AB 7342 ABC 3 BC 116 C 5 CD 10 BCD 502 ABCD 23 D 132 DD 6
AAB 1123 DCDD 83
The output of the script would be:
Original Pois Makeup Freq_Max_Delta AAB A AB [AB, 7342][A, 56] 7398 DCDD D C DD [D, 132][DD, 6][C, 5] 143
In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpa and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]
def contains(small, big):
small_ = len(small)
for i in xrange(len(big) - small_ + 1):
if big[i:i + small_] == small:
return (i, i + small_)
return None
def find_best(word, corpas):
combos = {}
for corpa, frequency in corpas.items():
c = contains(corpa, word)
if c:
combos[word[c[0]:c[1]]] = frequency
return combos
def create_combination(combos, word):
if not combos:
return None
combo_keys = combos.keys()
word = sorted(word)
combinations_ = [
j
for i in range(len(combo_keys) + 1)
for j in itertools.combinations(combo_keys, i)
if sorted(''.join(j)) == word
]
if not combinations_:
return None
result = None
for combination in combinations_:
sub = [(v, combos[v]) for v in combination]
total = sum(map(operator.itemgetter(1), sub))
if not result or total > result[2]:
result = [combination, sub, total]
return result
def display_combination(combination, word):
if combination is None:
print '\t\t'.join([word, 'Nothing Found'])
return None
part_final = ''.join(
'[' + v[0] + ', ' + str(v[1]) + ']'
for v in combination[1]
)
print '\t\t'.join([word,' '.join(combination[0]), part_final, str(combination[2])])
def main():
parser = ArgumentParser(description=__doc__)
parser.add_argument("-w", "--wordlist", help="", required=True)
parser.add_argument("-c", "--corpa", help="", required=True)
args = parser.parse_args()
corpas = read_corpa(args.corpa)
for word in read_words(args.wordlist):
combos = find_best(word, corpas)
results = create_combination(combos, word)
display_combination(results, word)
if __name__ == '__main__':
main()