Merge Sort efficiency in Python

Question 1

I am writing a script that takes in many sorted .dat files. I have included sample data but the dataset is quite large. The desired outcome is to have one file with an alphabetically sorted list of words.

I want to know if I am missing anything with regards to dealing with large amounts of words, mainly dealing with memory usage and catching errors, logging and tests but still continuing to iterate over the files without interruption.

Using the data then * 100000 allows me to sort 11,000,000 or so lines no problem. When dealing with the larger sets I want to sort them but without crashing.

Does Python's sort() work well for this operation or should I be looking at anything else that could be faster or more efficient?

Is it worth employing the multiprocessing module to help handle these tasks? If so, what is the best implementation? Researching, I have found this article, which, whilst dealing with large files, might be a similar process to sorting the large list at the end of the function.

The 'cost' of these operations is very important for the task.

REPL

 dat1 ='allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
 'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks',
 'allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
 'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks'
 dat2 = 'abut', 'actuators', 'advert', 'altitude', 'animals', 'aquaplaned', 'battlement', 'bedside', 'bludgeoning', 'boeing', 'bubblier', 'calendaring', 'callie', 'cardiology', 'caryatides', 'chechnya', 'coffey', 'collage', 'commandos', 'defensive', 'diagnosed', 'doctor', 'elaborate', 'elbow', 'enlarged', 'evening', 'flawed', 'glowers', 'guested', 'handel', 'homogenized', 'husbands', 'hypermarket', 'inge', 'inhibits', 'interloper', 'iowan', 'junco', 'junipers', 'keen', 'logjam', 'lonnie', 'louver', 'low', 'marcelo', 'marginalia', 'matchmaker', 'mold', 'monmouth', 'nautilus', 'noblest', 'north', 'novelist', 'oblations', 'official', 'omnipresent', 'orators', 'overproduce', 'passbooks', 'penalizes', 'pisses', 'precipitating', 'primness', 'quantity', 'quechua', 'rama', 'recruiters', 'recurrent', 'remembrance', 'rumple', 'saguaro', 'sailboard', 'salty', 'scherzo', 'seafarer', 'settles', 'sheryl', 'shoplifter', 'slavs', 'snoring', 'southern', 'spottiest', 'squawk', 'squawks', 'thievish', 'tightest', 'tires', 'tobacconist', 'tripling', 'trouper', 'tyros', 'unmistakably', 'unrepresentative', 'waviest'
 dat3 = 'administrated', 'aggressively', 'albee', 'amble', 'announcers', 'answers', 'arequipa', 'artichoke', 'awed', 'bacillus', 'backslider', 'bandier', 'bellow', 'beset', 'billfolds', 'boneless', 'braziers', 'brick', 'budge', 'cadiz', 'calligrapher', 'clip', 'confining', 'coronets', 'crispier', 'dardanelles', 'daubed', 'deadline', 'declassifying', 'delegating', 'despairs', 'disembodying', 'dumbly', 'dynamically', 'eisenhower', 'encryption', 'estes', 'etiologies', 'evenness', 'evillest', 'expansions', 'fireproofed', 'florence', 'forcing', 'ghostwritten', 'hakluyt', 'headboards', 'hegel', 'hibernate', 'honeyed', 'hope', 'horus', 'inedible', 'inflammation', 'insincerity', 'intuitions', 'ironclads', 'jeffrey', 'knobby', 'lassoing', 'loewi', 'madwoman', 'maurois', 'mechanistic', 'metropolises', 'modified', 'modishly', 'mongols', 'motivation', 'mudslides', 'negev', 'northward', 'outperforms', 'overseer', 'passport', 'pathway', 'physiognomy', 'pi', 'platforming', 'plodder', 'pools', 'poussin', 'pragmatically', 'premeditation', 'punchier', 'puncture', 'raul', 'readjusted', 'reflectors', 'reformat', 'rein', 'relives', 'reproduces', 'restraining', 'resurrection', 'revving', 'rosily', 'sadr', 'scolloped', 'shrubbery', 'side', 'simulations', 'slashing', 'speculating', 'subsidization', 'teaser', 'tourism', 'transfers', 'transnationals', 'triple', 'undermining', 'upheavals', 'vagina', 'victims', 'weird', 'whereabouts', 'wordiness'
 # lines = open(combined_file_name + file_type, 'r').readlines()
 # output = open("intermediate_alphabetical_order.dat", 'w')
 # for line in sorted(lines, key=lambda line: line.split()[0]):
 # output.write(line)
 # output.close()
 import datetime
 from functools import wraps
 from time import time
 datetme = datetime.datetime.now()
 date = datetme.strftime('%d %b %Y %H:%M:%S ').upper()
 # tuples are used to read in the data to save cost of memory usage
 combined_dat = dat1, dat2, dat3 # * 100000
 results = []
 log = () #TUPLE
 # decorator for speed test.
 def speed_test(f):
 @wraps(f)
 def wrapper(*a, **kw):
 start = time()
 result = f(*a, **kw)
 end = time()
 print('Elapsed time: {} s'.format(round(end - start, 8)))
 return result
 return wrapper
 @speed_test
 def merge_sort_lists(list_of_lists, *a, **kw):
 """takes in a list of lists/tuples and returns a sorted list"""
 try:
 for f in list_of_lists:
 try:
 for c in f:
 # recursion for lists in the list of lists... 
 if isinstance(c, list):
 merge_sort_lists([c])
 else:
 results.append(c)
 except:
 datetme, ":: Item: {} not added".format(c)
 # Logging
 # log.append("file {} not found".format(f))
 except:
 "file {} not found".format(f)
 # Logging
 # log.append("file {} not found".format(f))
 results.sort()
 with open('log.txt', 'a') as f:
 for line in log:
 f.write(line)
 merge_sort_lists(combined_dat)
 # Tests
 def combined_length(combined):
 """calculates the length of a list of lists"""
 com_len = 0
 for i in combined:
 # if isinstance(i, list):
 # combined_length(i)
 # else:
 com_len += int(len(i))
 return com_len
 com_length = (combined_length(combined_dat))
 res_len = len(results)
 print('\nResult Length: ', res_len, '\nCombined Lists: ', com_length)
 assert(res_len == com_length)

Question 2

Timsort is remarkably fast for merge operations.

Question 3

Is there likely redundancy within each single file? Would a unique-ify stage make sense?

Question 4

dat1, dat2, dat3 are typical file sizes.. my main issue will be processing the sort. Excuse me for not understanding.. but what is a unique-ify stage?

Question 5

I was wondering if the individual files might contain repeats of the same word: dog, dog, dog, dog, elephant, elephant, elephant.

Question 6

@AustinHastings I just checked and timsort is the algorithm used in python sort()

Question 7

It occurs to me that your problem description doesn't match your code.

You state the problem as "takes in many sorted .dat files" and needing to "have one file with an alphabetically sorted list of words".

This is a merge, not even a merge sort. You appear to be trying to process all of the data in memory, which is also not necessary. However, Timsort is scary-fast. So it may be that loading everything into memory, sorting it, and unloading it is the fastest option. If the total data size isn't> 1 Gb, or> available RAM, whichever is less, this is likely true.

Option 1: Timsort FTW!

def merge_files(outfile, *filenames):
 words = []
 for filename in filenames:
 with open(filename) as f:
 words.extend(f.readlines())
 words.sort() # NB: in-place sort! Don't call sorted(words)!
 with open(outfile, 'w') as out:
 map(out.write, words)

Option 2: ZOMG! How much data do you have??!!?!!

def merge(*iters):
 if len(iters) == 1:
 return iters[0]
 iterlist = []
 values = []
 # Initialize values[] and also convert iters tuple to list (for del)
 for i, it in enumerate(iters):
 try:
 values.append(next(it))
 iterlist.append(it) # Order matter: next() might throw.
 except StopIteration:
 pass
 iters = iterlist
 while values:
 nextvalue = min(values)
 yield nextvalue
 try:
 i = values.index(nextvalue)
 values[i] = next(iters[i])
 except StopIteration:
 del values[i], iters[i]
def merge_files(outfile, *filenames):
 iters = [iter(open(fn)) for fn in filenames]
 with open(outfile, 'w') as out:
 map(out.write, merge(iters))

This looks like it might actually make a nice function for itertools. It just needs a key= option, I guess.

Question 8

Interesting, I did not know that the try block would leak variables defined inside. Not only to the except block (where you are able to access i for the del), but also outside of it...

Question 9

For merging multiple sorted iterables, Python has heapq.merge. You might also want to look at this answer.

Question 10

@GarethRees Good to know. I had flagged to myself that min() was likely a time sink, and perhaps the values could be stored in a heapq. Nice that someone already wrote that code for me!

aghast aghast 12.6k25 silver badges46 bronze badges · Accepted Answer · 2018-02-14 05:22:48Z

It occurs to me that your problem description doesn't match your code.

You state the problem as "takes in many sorted .dat files" and needing to "have one file with an alphabetically sorted list of words".

This is a merge, not even a merge sort. You appear to be trying to process all of the data in memory, which is also not necessary. However, Timsort is scary-fast. So it may be that loading everything into memory, sorting it, and unloading it is the fastest option. If the total data size isn't> 1 Gb, or> available RAM, whichever is less, this is likely true.

Option 1: Timsort FTW!

def merge_files(outfile, *filenames):
 words = []
 for filename in filenames:
 with open(filename) as f:
 words.extend(f.readlines())
 words.sort() # NB: in-place sort! Don't call sorted(words)!
 with open(outfile, 'w') as out:
 map(out.write, words)

Option 2: ZOMG! How much data do you have??!!?!!

def merge(*iters):
 if len(iters) == 1:
 return iters[0]
 iterlist = []
 values = []
 # Initialize values[] and also convert iters tuple to list (for del)
 for i, it in enumerate(iters):
 try:
 values.append(next(it))
 iterlist.append(it) # Order matter: next() might throw.
 except StopIteration:
 pass
 iters = iterlist
 while values:
 nextvalue = min(values)
 yield nextvalue
 try:
 i = values.index(nextvalue)
 values[i] = next(iters[i])
 except StopIteration:
 del values[i], iters[i]
def merge_files(outfile, *filenames):
 iters = [iter(open(fn)) for fn in filenames]
 with open(outfile, 'w') as out:
 map(out.write, merge(iters))

This looks like it might actually make a nice function for itertools. It just needs a key= option, I guess.

Interesting, I did not know that the try block would leak variables defined inside. Not only to the except block (where you are able to access i for the del), but also outside of it...
For merging multiple sorted iterables, Python has heapq.merge. You might also want to look at this answer.
@GarethRees Good to know. I had flagged to myself that min() was likely a time sink, and perhaps the values could be stored in a heapq. Nice that someone already wrote that code for me!

Stack Exchange Network

Merge Sort efficiency in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Merge Sort efficiency in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions