Natural merge: mergesort that uses already sorted subarrays

Question 1

The code bellow is my implementation for the natural merge exercise in Robert Sedgwick's Algorithms book:

Write a version of bottom-up mergesort that takes advantage of order in the array by proceeding as follows each time it needs to find two arrays to merge: find a sorted subarray (by incrementing a pointer until finding an entry that is smaller than its predecessor in the array), then find the next, then merge them.

def merge(a, lo, mi, hi):
 aux_lo = deque(a[lo:mi])
 aux_hi = deque(a[mi:hi])
 for i in range(lo, hi):
 if len(aux_lo) and len(aux_hi):
 a[i] = aux_lo.popleft() if aux_lo[0] < aux_hi[0] else aux_hi.popleft()
 elif len(aux_lo) or len(aux_hi):
 a[i] = aux_lo.popleft() if aux_lo else aux_hi.popleft()
def find_next_stop(a, start):
 if start >= len(a)-1:
 return start
 stop = start + 1
 if a[start] < a[stop]:
 while(stop<len(a)-1 and a[stop] <= a[stop+1]):
 stop += 1
 else:
 while(stop<len(a)-1 and a[stop] >= a[stop+1]):
 stop += 1
 _stop = stop
 while(start<_stop):
 a[_stop], a[start] = a[start], a[_stop]
 start += 1
 _stop -= 1
 return stop
def natural_merge(a):
 lo = hi = 0
 while(True):
 lo = hi
 mi = find_next_stop(a, lo)
 if lo == 0 and mi == len(a) - 1:
 return
 hi = find_next_stop(a, mi)
 if mi == hi == len(a)-1:
 lo = hi = 0
 continue
 merge(a, lo, mi, hi)

I referenced this answer.

Question 2

Add docstrings and typed arguments. Otherwise it's not bad

Question 3

A pity the code presented is uncommented but for a cryptic this takes more space than allowed - extra constraint(s)? (There's more than one error in I take this anwser as a referenced..)

Question 4

@greybeard Thanks for helping me point out the errors.

Question 5

I've reverted your change. The code in the question should not be updated to ensure that answers stay relevant.

Question 6

@Josay But if you could update your answer according to my change reverted by you, I'd appreciate it very much.

Question 7

Tests and bugs

Your code looks mostly good. Before trying to go and change the code to see what can be improved. I usually try to write a few simple tests. This is even easier when you have a reference implementation that can be compared to your function. In your case, I wrote:

TESTS = [
 [],
 [0],
 [0, 0, 0],
 [0, 1, 2],
 [0, 1, 2, 3, 4, 5],
 [5, 4, 3, 2, 1, 0],
 [5, 2, 3, 1, 0, 4],
 [5, 2, 5, 3, 1, 3, 0, 4, 5],
]
for t in TESTS:
 print(t)
 ref_lst, tst_lst = list(t), list(t)
 ref_lst.sort()
 natural_merge(tst_lst)
 print(ref_lst, tst_lst)
 assert ref_lst == tst_lst

which leads to a first comment: the empty list is not handled properly and the function never returns.

Improving merge

The case elif len(aux_lo) or len(aux_hi) seems complicated as we check if aux_lo just after. Things would be clearer if we were to split in two different cases:

 if len(aux_lo) and len(aux_hi):
 a[i] = aux_lo.popleft() if aux_lo[0] < aux_hi[0] else aux_hi.popleft()
 elif len(aux_lo):
 a[i] = aux_lo.popleft()
 elif len(aux_hi):
 a[i] = aux_hi.popleft()

Also, we could reuse the fact that in a boolean context, list are considered to be true if and only if they are not empty to write:

for i in range(lo, hi):
 if aux_lo and aux_hi:
 a[i] = aux_lo.popleft() if aux_lo[0] < aux_hi[0] else aux_hi.popleft()
 elif aux_lo:
 a[i] = aux_lo.popleft()
 elif aux_hi:
 a[i] = aux_hi.popleft()

Improving find_next_stop

You don't need so many parenthesis.

You could store len(a) - 1 in a variable in order not to re-compute it every time.

The name _stop is pretty ugly. I do not have any great suggestion for an alternative but end seems okay-ish.

More tests... and more bugs

I wanted to add a few tests to verify an assumption... and I stumbled upon another issue.

Here is the corresponding test suite:

TESTS = [
 [],
 [0],
 [0, 0, 0],
 [0, 1, 2],
 [0, 1, 2, 3, 4, 5],
 [5, 4, 3, 2, 1, 0],
 [0, 1, 2, 2, 1, 0],
 [0, 1, 2, 3, 2, 1, 0],
 [5, 2, 3, 1, 0, 4],
 [5, 2, 5, 3, 1, 3, 0, 4, 5],
 [5, 2, 5, 3, 1, 3, 0, 4, 5, 3, 1, 0, 1, 5, 2, 5, 3, 1, 3, 0, 4, 5],
]

Taking into account my comments and the fix you tried to add in the question, we now have:

from collections import deque
def merge(a, lo, mi, hi):
 aux_lo = deque(a[lo:mi])
 aux_hi = deque(a[mi:hi])
 for i in range(lo, hi):
 if aux_lo and aux_hi:
 a[i] = aux_lo.popleft() if aux_lo[0] < aux_hi[0] else aux_hi.popleft()
 elif aux_lo:
 a[i] = aux_lo.popleft()
 elif aux_hi:
 a[i] = aux_hi.popleft()
def find_next_stop(a, start):
 upper = len(a) - 1
 if start >= upper:
 return start
 stop = start + 1
 if a[start] <= a[stop]:
 while stop < upper and a[stop] <= a[stop+1]:
 stop += 1
 else:
 while stop < upper and a[stop] >= a[stop+1]:
 stop += 1
 end = stop
 while start < end:
 a[end], a[start] = a[start], a[end]
 start += 1
 end -= 1
 return stop
def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 lo = hi = 0
 while True:
 lo = hi
 mi = find_next_stop(a, lo)
 if lo == 0 and mi == upper:
 return
 hi = find_next_stop(a, mi)
 if mi == hi == upper:
 lo = hi = 0
 else:
 merge(a, lo, mi, hi)
TESTS = [
 [],
 [0],
 [0, 0, 0],
 [0, 1, 2],
 [0, 1, 2, 3, 4, 5],
 [5, 4, 3, 2, 1, 0],
 [0, 1, 2, 2, 1, 0],
 [0, 1, 2, 3, 2, 1, 0],
 [5, 2, 3, 1, 0, 4],
 [5, 2, 5, 3, 1, 3, 0, 4, 5],
 [5, 2, 5, 3, 1, 3, 0, 4, 5, 3, 1, 0, 1, 5, 2, 5, 3, 1, 3, 0, 4, 5],
]
for t in TESTS:
 print(t)
 ref_lst, tst_lst = list(t), list(t)
 ref_lst.sort()
 natural_merge(tst_lst)
 print(ref_lst, tst_lst)
 assert ref_lst == tst_lst

More improvements in find_next_stop

We have a while loop but we can compute the number of iterations we'll need: it corresponds to have the distance between start and stop. We could use a for _ in range loop to perform this. It has pros and cons but one of the key aspect is that we do not need to change start and stop, thus we don't need to copy the value in a variable.

 for k in range((1 + stop - start) // 2):
 i, j = start + k, stop - k
 a[i], a[j] = a[j], a[i]

More improvements in natural_merge

A few steps can be used to re-organise the function:

see that we can move the assignment lo = hi from the beginning of the loop to the end of the loop with no impact
realise that it is already done in the first branch of the test already so move it to the else block exclusively
see that the initialisation of hi is not required anymore
notice that the condition mi == upper is checked in 2 places (with the same value of mi and upper and that if lo != 0, we see that mi == upper directly leads to find_next(a, mi) returning upper and thus ending with mi == hi == upper and thus to the assignment lo = hi = 0.

At this stage, we have:

def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 lo = 0
 while True:
 mi = find_next_stop(a, lo)
 if mi == upper:
 if lo == 0:
 return
 lo = hi = 0
 else:
 hi = find_next_stop(a, mi)
 merge(a, lo, mi, hi)
 lo = hi

We can go further:

the assignment hi = 0 has no effect
we can reorganise conditions

We'd get

def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 lo = 0
 while True:
 mi = find_next_stop(a, lo)
 if mi != upper:
 hi = find_next_stop(a, mi)
 merge(a, lo, mi, hi)
 lo = hi
 elif lo == 0:
 return
 else:
 lo = 0

Interestingly, removing lo = hi leads a much more efficient code on my benchmark: the function returns much more quickly (because we always have lo == 0, we get out of the loop as soon as mi == upper) and the list is still fully sorted.

def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 while True:
 mi = find_next_stop(a, 0)
 if mi == upper:
 return
 hi = find_next_stop(a, mi)
 merge(a, 0, mi, hi)

This looked surprising at first but thinking about it, it looks like this may be the way this algorithm is supposed to be.

Question 8

I've realized the meaning of the last snippet in your answer, awesome!

SylvainD SylvainD 29.7k1 gold badge49 silver badges93 bronze badges · Accepted Answer · 2018-12-25 23:25:02Z

Tests and bugs

Your code looks mostly good. Before trying to go and change the code to see what can be improved. I usually try to write a few simple tests. This is even easier when you have a reference implementation that can be compared to your function. In your case, I wrote:

TESTS = [
 [],
 [0],
 [0, 0, 0],
 [0, 1, 2],
 [0, 1, 2, 3, 4, 5],
 [5, 4, 3, 2, 1, 0],
 [5, 2, 3, 1, 0, 4],
 [5, 2, 5, 3, 1, 3, 0, 4, 5],
]
for t in TESTS:
 print(t)
 ref_lst, tst_lst = list(t), list(t)
 ref_lst.sort()
 natural_merge(tst_lst)
 print(ref_lst, tst_lst)
 assert ref_lst == tst_lst

which leads to a first comment: the empty list is not handled properly and the function never returns.

Improving merge

The case elif len(aux_lo) or len(aux_hi) seems complicated as we check if aux_lo just after. Things would be clearer if we were to split in two different cases:

 if len(aux_lo) and len(aux_hi):
 a[i] = aux_lo.popleft() if aux_lo[0] < aux_hi[0] else aux_hi.popleft()
 elif len(aux_lo):
 a[i] = aux_lo.popleft()
 elif len(aux_hi):
 a[i] = aux_hi.popleft()

Also, we could reuse the fact that in a boolean context, list are considered to be true if and only if they are not empty to write:

for i in range(lo, hi):
 if aux_lo and aux_hi:
 a[i] = aux_lo.popleft() if aux_lo[0] < aux_hi[0] else aux_hi.popleft()
 elif aux_lo:
 a[i] = aux_lo.popleft()
 elif aux_hi:
 a[i] = aux_hi.popleft()

Improving find_next_stop

You don't need so many parenthesis.

You could store len(a) - 1 in a variable in order not to re-compute it every time.

The name _stop is pretty ugly. I do not have any great suggestion for an alternative but end seems okay-ish.

More tests... and more bugs

I wanted to add a few tests to verify an assumption... and I stumbled upon another issue.

Here is the corresponding test suite:

TESTS = [
 [],
 [0],
 [0, 0, 0],
 [0, 1, 2],
 [0, 1, 2, 3, 4, 5],
 [5, 4, 3, 2, 1, 0],
 [0, 1, 2, 2, 1, 0],
 [0, 1, 2, 3, 2, 1, 0],
 [5, 2, 3, 1, 0, 4],
 [5, 2, 5, 3, 1, 3, 0, 4, 5],
 [5, 2, 5, 3, 1, 3, 0, 4, 5, 3, 1, 0, 1, 5, 2, 5, 3, 1, 3, 0, 4, 5],
]

Taking into account my comments and the fix you tried to add in the question, we now have:

from collections import deque
def merge(a, lo, mi, hi):
 aux_lo = deque(a[lo:mi])
 aux_hi = deque(a[mi:hi])
 for i in range(lo, hi):
 if aux_lo and aux_hi:
 a[i] = aux_lo.popleft() if aux_lo[0] < aux_hi[0] else aux_hi.popleft()
 elif aux_lo:
 a[i] = aux_lo.popleft()
 elif aux_hi:
 a[i] = aux_hi.popleft()
def find_next_stop(a, start):
 upper = len(a) - 1
 if start >= upper:
 return start
 stop = start + 1
 if a[start] <= a[stop]:
 while stop < upper and a[stop] <= a[stop+1]:
 stop += 1
 else:
 while stop < upper and a[stop] >= a[stop+1]:
 stop += 1
 end = stop
 while start < end:
 a[end], a[start] = a[start], a[end]
 start += 1
 end -= 1
 return stop
def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 lo = hi = 0
 while True:
 lo = hi
 mi = find_next_stop(a, lo)
 if lo == 0 and mi == upper:
 return
 hi = find_next_stop(a, mi)
 if mi == hi == upper:
 lo = hi = 0
 else:
 merge(a, lo, mi, hi)
TESTS = [
 [],
 [0],
 [0, 0, 0],
 [0, 1, 2],
 [0, 1, 2, 3, 4, 5],
 [5, 4, 3, 2, 1, 0],
 [0, 1, 2, 2, 1, 0],
 [0, 1, 2, 3, 2, 1, 0],
 [5, 2, 3, 1, 0, 4],
 [5, 2, 5, 3, 1, 3, 0, 4, 5],
 [5, 2, 5, 3, 1, 3, 0, 4, 5, 3, 1, 0, 1, 5, 2, 5, 3, 1, 3, 0, 4, 5],
]
for t in TESTS:
 print(t)
 ref_lst, tst_lst = list(t), list(t)
 ref_lst.sort()
 natural_merge(tst_lst)
 print(ref_lst, tst_lst)
 assert ref_lst == tst_lst

More improvements in find_next_stop

We have a while loop but we can compute the number of iterations we'll need: it corresponds to have the distance between start and stop. We could use a for _ in range loop to perform this. It has pros and cons but one of the key aspect is that we do not need to change start and stop, thus we don't need to copy the value in a variable.

 for k in range((1 + stop - start) // 2):
 i, j = start + k, stop - k
 a[i], a[j] = a[j], a[i]

More improvements in natural_merge

A few steps can be used to re-organise the function:

see that we can move the assignment lo = hi from the beginning of the loop to the end of the loop with no impact
realise that it is already done in the first branch of the test already so move it to the else block exclusively
see that the initialisation of hi is not required anymore
notice that the condition mi == upper is checked in 2 places (with the same value of mi and upper and that if lo != 0, we see that mi == upper directly leads to find_next(a, mi) returning upper and thus ending with mi == hi == upper and thus to the assignment lo = hi = 0.

At this stage, we have:

def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 lo = 0
 while True:
 mi = find_next_stop(a, lo)
 if mi == upper:
 if lo == 0:
 return
 lo = hi = 0
 else:
 hi = find_next_stop(a, mi)
 merge(a, lo, mi, hi)
 lo = hi

We can go further:

the assignment hi = 0 has no effect
we can reorganise conditions

We'd get

def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 lo = 0
 while True:
 mi = find_next_stop(a, lo)
 if mi != upper:
 hi = find_next_stop(a, mi)
 merge(a, lo, mi, hi)
 lo = hi
 elif lo == 0:
 return
 else:
 lo = 0

Interestingly, removing lo = hi leads a much more efficient code on my benchmark: the function returns much more quickly (because we always have lo == 0, we get out of the loop as soon as mi == upper) and the list is still fully sorted.

def natural_merge(a):
 upper = len(a) - 1
 if upper <= 0:
 return
 while True:
 mi = find_next_stop(a, 0)
 if mi == upper:
 return
 hi = find_next_stop(a, mi)
 merge(a, 0, mi, hi)

This looked surprising at first but thinking about it, it looks like this may be the way this algorithm is supposed to be.

I've realized the meaning of the last snippet in your answer, awesome!

Stack Exchange Network

Natural merge: mergesort that uses already sorted subarrays

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Natural merge: mergesort that uses already sorted subarrays

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions