Counting sort in Python

Question 1

I wrote the following function to implement counting sort:

def CountSort( Arr, Max ):
 counter, retVal = [0] * (Max + 1), []
 for _ in Arr:
 counter[_] += 1
 for i in xrange( Max + 1 ):
 retVal += [i] * counter[i]
 return retVal

NOTE

The function takes the array of integer/list of integer as its first argument, and the maximum value it stores as the second. The only restriction is that the data in array should be distributed for \0ドル \le x \le Max\$.

I'm not happy with it since I have to maintain two lists in memory now, namely Arr and retVal. I could, obviously, use the following:

 # indentation preserved
 pos = 0
 for i in xrange( Max + 1 ):
 while counter[i] > 0:
 Arr[pos] = i
 pos += 1
 counter[i] -= 1

but it won't be as pythonic as the first one. An alternate approach is to delete the input Arr, and recreate it:

 # indentation preserved
 del Arr
 Arr = []
 for i in xrange( Max + 1 ):
 Arr += [i] * counter[i]

but this is not dissimilar to the second approach.

I'm open to suggestions on improvement of the first (or any of the latter).

Question 2

I'm confused about what this does, is it a sorting algorithm for integers or is it counting the values? Or is it somehow both?

Question 3

Is there a reason we're not using collections.Counter?

Question 4

@SuperBiasedMan, its a non-comparision sorting algorithm for Integers.

Question 5

I'm going to review your code for clarity, as I had trouble reading it at all initially.

First, you're using PascalCase for naming where you're supposed to use snake_case. Python's style guide (PEP0008) says that functions and variables should use snake_case. PascalCase is reserved for classes. Particularly Arr looks like it's a class of yours, rather than a simple list. As for the actual name, use numbers. It gives a clue to the type of list and doesn't confuse people about whether or not it's a plain list.

def count_sort(numbers, max_value):

Just because Python allows multiple line assignment doesn't mean it should be used all the time. It's best used when the two values being assigned have some relationship, like initialising co-ordinates x, y = 0, 0 or swapping values a, b = b, a. In this case, it's just confusing. Again about names, I'd change retVal to results.

 results = []
 counter = [0] * (max_value + 1)

Using for _ in is the common style for when you don't actually need the value you're calling, but you use it to assign to counter. So instead you should use i which is Python style for a temporary loop integer (Python wont care that you reuse it later).

 for i in numbers:
 counter[i] += 1

Instead of looping to get i and then access counter with that, you can use Python's enumerate function. It will allow you to simultaneously get the index and value from each element of a list, like this:

 for i, count in enumerate(counter):
 results += [i] * count

Now that I've gone line by line, I can actually see how it works but that was a lot of work because of unfamiliar style and no comments. A docstring in particular will make it easier to understand the point of the parameters and what is returned.

Even now, I'm confused why max is necessary since it adds confusion and potentially generates uncaught errors. You could call max(numbers) and get the highest number from the list that way. But maybe you wanted to avoid using too many builtin functions? Of course as @Barry notes this adds execution time, but the current method doesn't handle errors or explain how to pass max_value or why it's necessary. If you're leaving it out to optimise then you should catch index errors and provide a more relevant error (like ValueError) and note the point of the value in your script.

Question 6

That's the counting sorts' contract. For ease we can compute max while calling function itself. res = count_sort(ary, max(ary))

Question 7

"You could call max(numbers)..." No you shouldn't! The point of counting sort is to sort large arrays of integers that have very small ranges very efficiently. Having to make an extra pass through the array is bad.

Question 8

@Barry That does make sense, but is entirely undocumented and in its current state can raise confusing errors. I'll amend a note about it.

Question 9

Well, you kind of have to have two arrays. You need your counts array and you need your output array. If you want to sort the input array in place, you can do that, that's just a completely different semantic for your algorithm. So you'll just need to figure that out up front: do you want to sort the input list (a la list.sort()) or do you want to return a new list that is sorted (a la sorted()).

That said, here's some comments on your initial solution. To start with:

counter, retVal = [0] * (Max + 1), []

There is no reason to do such a thing. That is unnecessary terse and you're just making it hard to see what's going on. Furthermore, you don't even need retVal for now. So let's just start with the counter by itself:

counter = [0] * (Max + 1)
for _ in Arr:
 counter[_] += 1

Next, we have this loop:

for i in xrange( Max + 1 ):
 retVal += [i] * counter[i]

You can actually do a little better in terms of code appearance. We have our counter array, which has all of those indices... what we want to do is iterate over the counter array while also having access to the indices. That's what enumerate() is for:

retVal = []
for i, count in enumerate(counter):
 retVal += [i] * count
return retVal

Lastly, if you want to follow PEP-8, your function name should really be counting_sort(). And _ should probably be replaced with some single-letter name (x, a, elem, what have you).

Question 10

It would be helpful if you could provide more details about your use case. For example, why is the presence of two lists so objectionable? And are you intentionally avoiding built in functions and packages such as itertools?

Style

I would change _ to i in for _ in Arr:. The _ is used for a throwaway term, but you actually do use _. Also this i is related to the i in your second for loop so I think this would enhance understanding.

Efficiency

The rule of thumb is to avoid for loops when possible. It seems to me you could replace the first for loop with a dictionary comprehension and the second for loop with 1 or 2 list comprehrensions.
```
d = {x:arr.count(x) for x in a}
l = [[x]*d[x] for x in d.keys()]
return [item for sublist in l for item in sublist]
```

We can take this a bit further with generator expressions to keep the number of lists down by using a generator expression for l rather than a list comprehension. We could even fit it into the second list comprehension, to give you the following 2-line function:

d = {x:arr.count(x) for x in a}
return [item for sublist in ([x]*d[x] for x in d.keys()) for item in sublist]

You will notice I incorporated @SuperBiasedMan's comment about snake_case, which I agree with. Also I assume that the dictionary keys are stored in order, and this seems to be empirically true when I test the code, but you'd want to confirm that this is guaranteed in the docs.

Question 11

""Also I assume that the dictionary keys are stored in order - That's actually not true, dictionaries have arbitrary order and can't be relied on. You can use an OrderedDict if you want order (you have to import it from collections).

Question 12

@JoeWallis I actually didn't know that! Interesting. @.sunny Though this apparently works it still isn't reliable to trust as the order will vary in most dictionary uses.

Question 13

@SuperBiasedMan I specifically said in my post "you'd want to confirm" and did not guarantee it worked, only that it seemed to work.

Question 14

I know, that's why I was explaining it for you and the OP. I still upvoted your answer but wanted to add the info you weren't sure about.

Question 15

First example

You should not use _ in counter[_] += 1. _ is normally used to void output. And the behaviour changes in the command line utility.

>>> _ = 1
>>> 2
2
>>> _
2

Some of your style are un-Pythonic, CountSort( Arr, Max ) should be count_sort(arr, max_). Camels are classy, in Python. Where, almost, everything else is snake-case.

max_ isn't really Pythonic. Python implements a function max, and so to not overwrite this function it is better to use max_ than max. But in PEP8, it's explicitly stated that using a synonym is better. domain, max_number or length may be better.
So in short overwriting builtins is bad, unsurprisingly, and so use synonyms.

You can use a generator if using two lists bugs you a lot.

for number, amount in enumerate(counter):
 for _ in xrange(amount):
 yield number

The way you use it is like you do xrange, and so is only good if you only increment through the list.

Overall it's pretty solid.

Second example

You can change the second example to use two for loops, rather than one. The first that iterates through counter, and the second that iterates through an xrange of the amount returned. This portrays what you want to achieve better in Python.

def count_sort(arr, max_):
 counter = [0] * (max_ + 1)
 for i in arr:
 counter[i] += 1
 pos = 0
 for amount in counter:
 for _ in xrange(amount):
 arr[pos] = amount
 pos += 1
 
my_list = [3, 2, 1]
my_new_list = count_sort(my_list, 3)
print my_list
# [1, 2, 3]
print my_new_list
# None

So, if you didn't know about shallow copy's this would be strange. This happens as arr is a shallow copy, and so any changes to it change the original.

Third example

It's very uncommon to use del in Python. And IIRC there are some quirks with it.

Lets go through what happens if you use del in this instance. I also use enumerate as explained before.

def count_sort(arr, max_):
 counter = [0] * (max_ + 1)
 for i in arr:
 counter[i] += 1
 del arr
 arr = []
 for number, amount in enumerate(counter):
 arr += [number] * amount
 return arr
my_list = [3, 2, 1]
my_new_list = count_sort(my_list, 3)
print my_list
# [3, 2, 1]
print my_new_list
# [1, 2, 3]

In short it's a long-winded way of doing what you were doing before. But it also scares the reader that my_list would be gone, if they are unfamiliar with it.

Question 16

You could probably explain why you used max_ instead of just max, OP might not know about shadowing builtins.

Question 17

Where does i get its value from in the 2nd example? In the statement: arr[pos] = i

Question 18

Liked the third solution. Good to declare variable where its used instead of declaring all the variables at one place. But deleting array seems unnecessary, overall a good design.

score 4 · Accepted Answer · 2015-09-23 15:50:08Z

I'm going to review your code for clarity, as I had trouble reading it at all initially.

First, you're using PascalCase for naming where you're supposed to use snake_case. Python's style guide (PEP0008) says that functions and variables should use snake_case. PascalCase is reserved for classes. Particularly Arr looks like it's a class of yours, rather than a simple list. As for the actual name, use numbers. It gives a clue to the type of list and doesn't confuse people about whether or not it's a plain list.

def count_sort(numbers, max_value):

Just because Python allows multiple line assignment doesn't mean it should be used all the time. It's best used when the two values being assigned have some relationship, like initialising co-ordinates x, y = 0, 0 or swapping values a, b = b, a. In this case, it's just confusing. Again about names, I'd change retVal to results.

 results = []
 counter = [0] * (max_value + 1)

Using for _ in is the common style for when you don't actually need the value you're calling, but you use it to assign to counter. So instead you should use i which is Python style for a temporary loop integer (Python wont care that you reuse it later).

 for i in numbers:
 counter[i] += 1

Instead of looping to get i and then access counter with that, you can use Python's enumerate function. It will allow you to simultaneously get the index and value from each element of a list, like this:

 for i, count in enumerate(counter):
 results += [i] * count

Now that I've gone line by line, I can actually see how it works but that was a lot of work because of unfamiliar style and no comments. A docstring in particular will make it easier to understand the point of the parameters and what is returned.

Even now, I'm confused why max is necessary since it adds confusion and potentially generates uncaught errors. You could call max(numbers) and get the highest number from the list that way. But maybe you wanted to avoid using too many builtin functions? Of course as @Barry notes this adds execution time, but the current method doesn't handle errors or explain how to pass max_value or why it's necessary. If you're leaving it out to optimise then you should catch index errors and provide a more relevant error (like ValueError) and note the point of the value in your script.

That's the counting sorts' contract. For ease we can compute max while calling function itself. res = count_sort(ary, max(ary))
"You could call max(numbers)..." No you shouldn't! The point of counting sort is to sort large arrays of integers that have very small ranges very efficiently. Having to make an extra pass through the array is bad.
@Barry That does make sense, but is entirely undocumented and in its current state can raise confusing errors. I'll amend a note about it.

Stack Exchange Network

Counting sort in Python

NOTE

4 Answers 4

First example

Second example

Third example

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting sort in Python

NOTE

4 Answers 4

First example

Second example

Third example

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions