Inversion count via divide and conquer

Question 1

I'm happy to hear thoughts and ideas on structure/performance/testing/whatever and multi-threading, which I haven't gotten into yet with Python.

Full code here. Assignment file and a test file included in the folder.

Excerpt:

import unittest
import sys
def Merge_and_CountSplitInv(A,start_index,len_1st,len_2nd):
 """
 Merges two similar length pre-sorted sub-slices of array A while counting
 inversions.
 Input: 
 array A, start_index (location of first sub-slice)
 len_1st,len_2nd (subslice_lengths).
 Output: inversions (values where i<j, but A[i]>A[j])
 Side Effect: two consecutive A subsections are sorted.
 """
 inversions=0
 temp_array=[]
 index_1st=start_index
 index_2nd=start_index+len_1st
 #while both indices are in range
 while ((index_1st < start_index+len_1st) and
 (index_2nd < start_index+len_1st+len_2nd)):
 #place smaller value in temp_array, increase inversions 
 if A[index_1st]<=A[index_2nd]:
 temp_array.append(A[index_1st])
 index_1st += 1
 else:
 temp_array.append(A[index_2nd])
 index_2nd += 1
 inversions+=(start_index+len_1st-index_1st)
 #one index is out of range:
 if index_2nd == start_index+len_1st+len_2nd: # 2nd index complete
 #add leftover 1st half values to temp_array
 #before destroying them by the write process.
 temp_array.extend(A[index_1st:start_index+len_1st])
 else: # 1st index complete
 pass #no need to write over leftovers in 2nd sub-array
 #write temp_array over A[start_index:start_index+len(temp_array)]
 write_index=start_index
 for value in temp_array:
 A[write_index]=value
 write_index +=1
 return inversions
def Sort_and_Count(A, n, start_index=0):
 """
 Input: array A, length n, start_index(default=0)
 Output: inversions (values where i<j, but A[i]>A[j])
 Side Effect: A is sorted.
 """
 if n==1:
 return 0 # base case, array of length 1
 else:
 # Input: 1st and 2nd halves of current sub-array
 len_1st=n//2
 len_2nd=n//2+n%2
 x=Sort_and_Count(A, len_1st, start_index)
 y=Sort_and_Count(A, len_2nd, start_index+len_1st)
 # merge the (newly sorted) half-sized sub-arrays
 z=Merge_and_CountSplitInv(A,start_index,len_1st,len_2nd)
 return x+y+z
class TestSort_and_Count(unittest.TestCase):
 """
 Basic test class
 """
 def test_Sort_and_Count(self):
 A=[1]
 res1 = Sort_and_Count(A,len(A)) # single element
 self.assertEqual(res1, 0)
 B=[1,3,5,2,4,6]
 res2 = Sort_and_Count(B,len(B)) # even length
 self.assertEqual(res2, 3)
 C=[1,3,5,2,4,6,3]
 res3 = Sort_and_Count(C,len(C)) # odd length, duplicate value
 self.assertEqual(res3, 6)
def main(file_name):
 #TODO: take values from file and run sort_and_count
 with open(file_name) as fh:
 if file_name[:4] == 'test':
 print(fh.readline()) # get rid of first answer line from debug file
 A = list(map(int, [line.strip() for line in fh]))
 print(Sort_and_Count(A,len(A)))
if __name__ == '__main__':
 #TODO: working with argv to accept file input
 if len(sys.argv) > 2:
 sys.exit("Usage: inv_count <file_name> (leave empty for testing)")
 if len(sys.argv) == 1:
 print("No filename input, testing...")
 unittest.main()
 # else: argv == 2
 main(sys.argv[1])

Question 2

1. Bug

If you pass the empty list to Sort_and_Count, it recurses until it runs out of stack:

>>> Sort_and_Count([], 0)
Traceback (most recent call last):
 File "cr107928.py", line 67, in test_Sort_and_Count
 Sort_and_Count([],0)
 ... many lines omitted ...
 File "cr107928.py", line 49, in Sort_and_Count
 if n==1:
RecursionError: maximum recursion depth exceeded in comparison

2. Review

The code omits spaces around operators. This makes it hard to read. It is better to follow the Python style guide (PEP8):

Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not). [...] If operators with different priorities are used, consider adding whitespace around the operators with the lowest priority(ies).
The names do not follow the Python style guide. This recommends:

Class names should normally use the CapWords convention.

so TestSortAndCount would be better than TestSort_and_Count.

Function names should be lowercase, with words separated by underscores as necessary to improve readability.

so merge_and_count_inversions would be better than Merge_and_CountSplitInv.

It's not compulsory to follow the Python style guide, but it makes it easier for other Python programmers to read your code.
Python's documentation uses the term "sequence" to refer to indexable collections, not "array". There's less risk of confusion if you stick to the standard term.
The docstring says, "Merges two similar length pre-sorted sub-slices", but in fact the function works fine even if the subsequences are not of similar length. I know that when called from Sort_and_Count the subsequences will be equal in length or differ by one, but it's better to document the actual behaviour of the function.
Instead of index_1st and index_2nd, I would use i and j. These variables are used many times, so using short names reduces the amount of code and make it easier to read.
The arguments len_1st and len_2nd are only used within the expressions start_index+len_1st and start_index+len_1st+len_2nd. This suggests that the arguments to the function should be start, middle and end instead.
Instead of updating A one element at a time:
```
write_index=start_index
for value in temp_array:
 A[write_index]=value
 write_index +=1
```
update all elements at once using a slice assignment:
```
A[start_index:start_index + len(temp_array)] = temp_array
```
This is efficient (in CPython, anyway) because the replacement is the same length as the slice being assigned.
In Sort_and_Count, the caller is expected to pass in the length of the sequence. It would be simpler if Sort_and_Count called len instead. This could easily be done using a helper function; see updated code below.
The test case doesn't try the empty list (see "Bug" above).
The test cases only test that the inversion count is correct, but don't actually test that the sequence is sorted! You need something like:
```
self.assertEqual(A, sorted(A))
```
The test cases have hard-coded counts of inversions. But it is easy to count the inversions in a sequence, using itertools.combinations:
```
sum(j < i for i, j in combinations(seq, 2))
```
This is inefficient (it's \$ Θ(n^2) \$), but clearly correct, and so a good comparison for the efficient, but not clearly correct, counts produced by the function under test.

The test cases are very repetitive. This could be avoided with a helper method, something like this:

def check(self, seq):
 expected = sum(j < i for i, j in combinations(seq, 2))
 found = sort_and_count_inversions(seq)
 self.assertEqual(expected, found)
 self.assertEqual(seq, sorted(seq))

The test cases are not very thorough. You could easily test many more cases using random test generation:
```
for _ in range(100):
 self.check([randrange(100) for _ in range(randrange(100))])
```

3. Revised code

from itertools import combinations
from random import randrange
from unittest import TestCase
def merge_and_count_inversions(seq, start, middle, end):
 """Merge two adjacent sorted sub-sequences of seq and count
 inversions.
 Arguments: 
 seq -- sequence to sort
 start -- index of beginning of first sorted subsequence
 middle -- end of first, and beginning of second, sorted subsequence
 end -- end of second sorted subsequence
 Result: number of inversions (cases where i < j, but seq[i] > seq[j]).
 Side effect: seq is sorted between start and end.
 """
 assert 0 <= start < middle < end <= len(seq)
 inversions = 0
 temp = []
 i = start
 j = middle
 while i < middle and j < end:
 if seq[i] <= seq[j]:
 temp.append(seq[i])
 i += 1
 else:
 temp.append(seq[j])
 j += 1
 inversions += middle - i
 if j == end:
 # Second subsequence is complete: process remainder of first.
 temp.extend(seq[i:middle])
 else:
 # First subsequence is complete: no need to process
 # remaininder of second, since it does not move.
 pass
 # Insert sorted results into original sequence.
 seq[start:start+len(temp)] = temp
 return inversions
def sort_and_count_inversions(seq):
 """Sort a sequence and count inversions.
 Argument: seq -- a sequence
 Result: number of inversions (cases where i < j, but seq[i] > seq[j]).
 Side effect: seq is sorted.
 """
 def sort_and_count(seq, start, end):
 if end - start < 2:
 return 0
 middle = (start + end) // 2
 return (sort_and_count(seq, start, middle)
 + sort_and_count(seq, middle, end)
 + merge_and_count_inversions(seq, start, middle, end))
 return sort_and_count(seq, 0, len(seq))
class TestSortAndCount(TestCase):
 def check(self, seq):
 expected = sum(j < i for i, j in combinations(seq, 2))
 found = sort_and_count_inversions(seq)
 self.assertEqual(expected, found)
 self.assertEqual(seq, sorted(seq))
 def runTest(self):
 self.check([]) # empty sequence
 self.check([1]) # single element
 self.check([1, 3, 5, 2, 4, 6]) # even length
 self.check([1, 3, 5, 2, 4, 6, 3]) # odd length, duplicate value
 for _ in range(100): # random test cases
 self.check([randrange(100) for _ in range(randrange(100))])

Question 3

I riffed over all the notes and will go over them again later - esp. the testing notes. Nice find on the bug +1 just for that. On a quick note, I wanted to use slice replacement, but was worried it takes O(n), which breaks the algorithm.

Question 4

The implementation of slice assignment in CPython checks whether the replacement is the same size as the slice being replaced, and if so it does the efficient thing. See list_ass_slice in listobject.c.

Question 5

3. the term "sequence" - will try to keep in mind.

Question 6

8. It would be simpler if Sort_and_Count called len instead. This could easily be done using a helper function; -- very nice workaround.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2015-10-20 17:39:59Z

1. Bug

If you pass the empty list to Sort_and_Count, it recurses until it runs out of stack:

>>> Sort_and_Count([], 0)
Traceback (most recent call last):
 File "cr107928.py", line 67, in test_Sort_and_Count
 Sort_and_Count([],0)
 ... many lines omitted ...
 File "cr107928.py", line 49, in Sort_and_Count
 if n==1:
RecursionError: maximum recursion depth exceeded in comparison

2. Review

The code omits spaces around operators. This makes it hard to read. It is better to follow the Python style guide (PEP8):

Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not). [...] If operators with different priorities are used, consider adding whitespace around the operators with the lowest priority(ies).
The names do not follow the Python style guide. This recommends:

Class names should normally use the CapWords convention.

so TestSortAndCount would be better than TestSort_and_Count.

Function names should be lowercase, with words separated by underscores as necessary to improve readability.

so merge_and_count_inversions would be better than Merge_and_CountSplitInv.

It's not compulsory to follow the Python style guide, but it makes it easier for other Python programmers to read your code.
Python's documentation uses the term "sequence" to refer to indexable collections, not "array". There's less risk of confusion if you stick to the standard term.
The docstring says, "Merges two similar length pre-sorted sub-slices", but in fact the function works fine even if the subsequences are not of similar length. I know that when called from Sort_and_Count the subsequences will be equal in length or differ by one, but it's better to document the actual behaviour of the function.
Instead of index_1st and index_2nd, I would use i and j. These variables are used many times, so using short names reduces the amount of code and make it easier to read.
The arguments len_1st and len_2nd are only used within the expressions start_index+len_1st and start_index+len_1st+len_2nd. This suggests that the arguments to the function should be start, middle and end instead.
Instead of updating A one element at a time:
```
write_index=start_index
for value in temp_array:
 A[write_index]=value
 write_index +=1
```
update all elements at once using a slice assignment:
```
A[start_index:start_index + len(temp_array)] = temp_array
```
This is efficient (in CPython, anyway) because the replacement is the same length as the slice being assigned.
In Sort_and_Count, the caller is expected to pass in the length of the sequence. It would be simpler if Sort_and_Count called len instead. This could easily be done using a helper function; see updated code below.
The test case doesn't try the empty list (see "Bug" above).
The test cases only test that the inversion count is correct, but don't actually test that the sequence is sorted! You need something like:
```
self.assertEqual(A, sorted(A))
```
The test cases have hard-coded counts of inversions. But it is easy to count the inversions in a sequence, using itertools.combinations:
```
sum(j < i for i, j in combinations(seq, 2))
```
This is inefficient (it's \$ Θ(n^2) \$), but clearly correct, and so a good comparison for the efficient, but not clearly correct, counts produced by the function under test.

The test cases are very repetitive. This could be avoided with a helper method, something like this:

def check(self, seq):
 expected = sum(j < i for i, j in combinations(seq, 2))
 found = sort_and_count_inversions(seq)
 self.assertEqual(expected, found)
 self.assertEqual(seq, sorted(seq))

The test cases are not very thorough. You could easily test many more cases using random test generation:
```
for _ in range(100):
 self.check([randrange(100) for _ in range(randrange(100))])
```

3. Revised code

from itertools import combinations
from random import randrange
from unittest import TestCase
def merge_and_count_inversions(seq, start, middle, end):
 """Merge two adjacent sorted sub-sequences of seq and count
 inversions.
 Arguments: 
 seq -- sequence to sort
 start -- index of beginning of first sorted subsequence
 middle -- end of first, and beginning of second, sorted subsequence
 end -- end of second sorted subsequence
 Result: number of inversions (cases where i < j, but seq[i] > seq[j]).
 Side effect: seq is sorted between start and end.
 """
 assert 0 <= start < middle < end <= len(seq)
 inversions = 0
 temp = []
 i = start
 j = middle
 while i < middle and j < end:
 if seq[i] <= seq[j]:
 temp.append(seq[i])
 i += 1
 else:
 temp.append(seq[j])
 j += 1
 inversions += middle - i
 if j == end:
 # Second subsequence is complete: process remainder of first.
 temp.extend(seq[i:middle])
 else:
 # First subsequence is complete: no need to process
 # remaininder of second, since it does not move.
 pass
 # Insert sorted results into original sequence.
 seq[start:start+len(temp)] = temp
 return inversions
def sort_and_count_inversions(seq):
 """Sort a sequence and count inversions.
 Argument: seq -- a sequence
 Result: number of inversions (cases where i < j, but seq[i] > seq[j]).
 Side effect: seq is sorted.
 """
 def sort_and_count(seq, start, end):
 if end - start < 2:
 return 0
 middle = (start + end) // 2
 return (sort_and_count(seq, start, middle)
 + sort_and_count(seq, middle, end)
 + merge_and_count_inversions(seq, start, middle, end))
 return sort_and_count(seq, 0, len(seq))
class TestSortAndCount(TestCase):
 def check(self, seq):
 expected = sum(j < i for i, j in combinations(seq, 2))
 found = sort_and_count_inversions(seq)
 self.assertEqual(expected, found)
 self.assertEqual(seq, sorted(seq))
 def runTest(self):
 self.check([]) # empty sequence
 self.check([1]) # single element
 self.check([1, 3, 5, 2, 4, 6]) # even length
 self.check([1, 3, 5, 2, 4, 6, 3]) # odd length, duplicate value
 for _ in range(100): # random test cases
 self.check([randrange(100) for _ in range(randrange(100))])

I riffed over all the notes and will go over them again later - esp. the testing notes. Nice find on the bug +1 just for that. On a quick note, I wanted to use slice replacement, but was worried it takes O(n), which breaks the algorithm.
The implementation of slice assignment in CPython checks whether the replacement is the same size as the slice being replaced, and if so it does the efficient thing. See list_ass_slice in listobject.c.
8. It would be simpler if Sort_and_Count called len instead. This could easily be done using a helper function; -- very nice workaround.

Stack Exchange Network

Inversion count via divide and conquer

1 Answer 1

1. Bug

2. Review

3. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Inversion count via divide and conquer

1 Answer 1

1. Bug

2. Review

3. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions