I am writing a script that takes in many sorted .dat files. I have included sample data but the dataset is quite large. The desired outcome is to have one file with an alphabetically sorted list of words.
I want to know if I am missing anything with regards to dealing with large amounts of words, mainly dealing with memory usage and catching errors, logging and tests but still continuing to iterate over the files without interruption.
Using the data then * 100000 allows me to sort 11,000,000 or so lines no problem. When dealing with the larger sets I want to sort them but without crashing.
Does Python's sort()
work well for this operation or should I be looking at anything else that could be faster or more efficient?
Is it worth employing the multiprocessing module to help handle these tasks? If so, what is the best implementation? Researching, I have found this article, which, whilst dealing with large files, might be a similar process to sorting the large list at the end of the function.
The 'cost' of these operations is very important for the task.
dat1 ='allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks',
'allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks'
dat2 = 'abut', 'actuators', 'advert', 'altitude', 'animals', 'aquaplaned', 'battlement', 'bedside', 'bludgeoning', 'boeing', 'bubblier', 'calendaring', 'callie', 'cardiology', 'caryatides', 'chechnya', 'coffey', 'collage', 'commandos', 'defensive', 'diagnosed', 'doctor', 'elaborate', 'elbow', 'enlarged', 'evening', 'flawed', 'glowers', 'guested', 'handel', 'homogenized', 'husbands', 'hypermarket', 'inge', 'inhibits', 'interloper', 'iowan', 'junco', 'junipers', 'keen', 'logjam', 'lonnie', 'louver', 'low', 'marcelo', 'marginalia', 'matchmaker', 'mold', 'monmouth', 'nautilus', 'noblest', 'north', 'novelist', 'oblations', 'official', 'omnipresent', 'orators', 'overproduce', 'passbooks', 'penalizes', 'pisses', 'precipitating', 'primness', 'quantity', 'quechua', 'rama', 'recruiters', 'recurrent', 'remembrance', 'rumple', 'saguaro', 'sailboard', 'salty', 'scherzo', 'seafarer', 'settles', 'sheryl', 'shoplifter', 'slavs', 'snoring', 'southern', 'spottiest', 'squawk', 'squawks', 'thievish', 'tightest', 'tires', 'tobacconist', 'tripling', 'trouper', 'tyros', 'unmistakably', 'unrepresentative', 'waviest'
dat3 = 'administrated', 'aggressively', 'albee', 'amble', 'announcers', 'answers', 'arequipa', 'artichoke', 'awed', 'bacillus', 'backslider', 'bandier', 'bellow', 'beset', 'billfolds', 'boneless', 'braziers', 'brick', 'budge', 'cadiz', 'calligrapher', 'clip', 'confining', 'coronets', 'crispier', 'dardanelles', 'daubed', 'deadline', 'declassifying', 'delegating', 'despairs', 'disembodying', 'dumbly', 'dynamically', 'eisenhower', 'encryption', 'estes', 'etiologies', 'evenness', 'evillest', 'expansions', 'fireproofed', 'florence', 'forcing', 'ghostwritten', 'hakluyt', 'headboards', 'hegel', 'hibernate', 'honeyed', 'hope', 'horus', 'inedible', 'inflammation', 'insincerity', 'intuitions', 'ironclads', 'jeffrey', 'knobby', 'lassoing', 'loewi', 'madwoman', 'maurois', 'mechanistic', 'metropolises', 'modified', 'modishly', 'mongols', 'motivation', 'mudslides', 'negev', 'northward', 'outperforms', 'overseer', 'passport', 'pathway', 'physiognomy', 'pi', 'platforming', 'plodder', 'pools', 'poussin', 'pragmatically', 'premeditation', 'punchier', 'puncture', 'raul', 'readjusted', 'reflectors', 'reformat', 'rein', 'relives', 'reproduces', 'restraining', 'resurrection', 'revving', 'rosily', 'sadr', 'scolloped', 'shrubbery', 'side', 'simulations', 'slashing', 'speculating', 'subsidization', 'teaser', 'tourism', 'transfers', 'transnationals', 'triple', 'undermining', 'upheavals', 'vagina', 'victims', 'weird', 'whereabouts', 'wordiness'
# lines = open(combined_file_name + file_type, 'r').readlines()
# output = open("intermediate_alphabetical_order.dat", 'w')
# for line in sorted(lines, key=lambda line: line.split()[0]):
# output.write(line)
# output.close()
import datetime
from functools import wraps
from time import time
datetme = datetime.datetime.now()
date = datetme.strftime('%d %b %Y %H:%M:%S ').upper()
# tuples are used to read in the data to save cost of memory usage
combined_dat = dat1, dat2, dat3 # * 100000
results = []
log = () #TUPLE
# decorator for speed test.
def speed_test(f):
@wraps(f)
def wrapper(*a, **kw):
start = time()
result = f(*a, **kw)
end = time()
print('Elapsed time: {} s'.format(round(end - start, 8)))
return result
return wrapper
@speed_test
def merge_sort_lists(list_of_lists, *a, **kw):
"""takes in a list of lists/tuples and returns a sorted list"""
try:
for f in list_of_lists:
try:
for c in f:
# recursion for lists in the list of lists...
if isinstance(c, list):
merge_sort_lists([c])
else:
results.append(c)
except:
datetme, ":: Item: {} not added".format(c)
# Logging
# log.append("file {} not found".format(f))
except:
"file {} not found".format(f)
# Logging
# log.append("file {} not found".format(f))
results.sort()
with open('log.txt', 'a') as f:
for line in log:
f.write(line)
merge_sort_lists(combined_dat)
# Tests
def combined_length(combined):
"""calculates the length of a list of lists"""
com_len = 0
for i in combined:
# if isinstance(i, list):
# combined_length(i)
# else:
com_len += int(len(i))
return com_len
com_length = (combined_length(combined_dat))
res_len = len(results)
print('\nResult Length: ', res_len, '\nCombined Lists: ', com_length)
assert(res_len == com_length)
-
2\$\begingroup\$ Timsort is remarkably fast for merge operations. \$\endgroup\$aghast– aghast2018年02月14日 03:43:34 +00:00Commented Feb 14, 2018 at 3:43
-
1\$\begingroup\$ Is there likely redundancy within each single file? Would a unique-ify stage make sense? \$\endgroup\$aghast– aghast2018年02月14日 03:44:35 +00:00Commented Feb 14, 2018 at 3:44
-
\$\begingroup\$ dat1, dat2, dat3 are typical file sizes.. my main issue will be processing the sort. Excuse me for not understanding.. but what is a unique-ify stage? \$\endgroup\$johnashu– johnashu2018年02月14日 03:50:58 +00:00Commented Feb 14, 2018 at 3:50
-
1\$\begingroup\$ I was wondering if the individual files might contain repeats of the same word: dog, dog, dog, dog, elephant, elephant, elephant. \$\endgroup\$aghast– aghast2018年02月14日 03:53:55 +00:00Commented Feb 14, 2018 at 3:53
-
1\$\begingroup\$ @AustinHastings I just checked and timsort is the algorithm used in python sort() \$\endgroup\$johnashu– johnashu2018年02月14日 03:58:20 +00:00Commented Feb 14, 2018 at 3:58
1 Answer 1
It occurs to me that your problem description doesn't match your code.
You state the problem as "takes in many sorted .dat files" and needing to "have one file with an alphabetically sorted list of words".
This is a merge, not even a merge sort. You appear to be trying to process all of the data in memory, which is also not necessary. However, Timsort is scary-fast. So it may be that loading everything into memory, sorting it, and unloading it is the fastest option. If the total data size isn't> 1 Gb, or> available RAM, whichever is less, this is likely true.
Option 1: Timsort FTW!
def merge_files(outfile, *filenames):
words = []
for filename in filenames:
with open(filename) as f:
words.extend(f.readlines())
words.sort() # NB: in-place sort! Don't call sorted(words)!
with open(outfile, 'w') as out:
map(out.write, words)
Option 2: ZOMG! How much data do you have??!!?!!
def merge(*iters):
if len(iters) == 1:
return iters[0]
iterlist = []
values = []
# Initialize values[] and also convert iters tuple to list (for del)
for i, it in enumerate(iters):
try:
values.append(next(it))
iterlist.append(it) # Order matter: next() might throw.
except StopIteration:
pass
iters = iterlist
while values:
nextvalue = min(values)
yield nextvalue
try:
i = values.index(nextvalue)
values[i] = next(iters[i])
except StopIteration:
del values[i], iters[i]
def merge_files(outfile, *filenames):
iters = [iter(open(fn)) for fn in filenames]
with open(outfile, 'w') as out:
map(out.write, merge(iters))
This looks like it might actually make a nice function for itertools. It just needs a key=
option, I guess.
-
1\$\begingroup\$ Interesting, I did not know that the
try
block would leak variables defined inside. Not only to theexcept
block (where you are able to accessi
for thedel
), but also outside of it... \$\endgroup\$Graipher– Graipher2018年02月14日 07:08:17 +00:00Commented Feb 14, 2018 at 7:08 -
3\$\begingroup\$ For merging multiple sorted iterables, Python has
heapq.merge
. You might also want to look at this answer. \$\endgroup\$Gareth Rees– Gareth Rees2018年02月14日 09:14:31 +00:00Commented Feb 14, 2018 at 9:14 -
\$\begingroup\$ @GarethRees Good to know. I had flagged to myself that
min()
was likely a time sink, and perhaps the values could be stored in a heapq. Nice that someone already wrote that code for me! \$\endgroup\$aghast– aghast2018年02月14日 09:55:29 +00:00Commented Feb 14, 2018 at 9:55