Sorts and data structures: a bucket sort

Question 1

I am currently going through and implementing basic algorithms and data structures as I feel I didn't learn enough about this during my Data Structures and Algorithms unit. So far I've done the following algorithms, yet to get to data structures.

Sorting

Bucket

import random
#toSort = [round(random.uniform(0.10, 0.99), 2) for _ in range(20)]
toSort = [round(random.uniform(0.10, 0.99), 2) for i in range(10, 0, -1)]
print(toSort)
def bucket_sort(array):
 buckets = [[] for i in range(len(array))]
 sorted_buckets = []
 for index in range(len(array)):
 bucket_num = len(array) * array[index] 
 print(bucket_num)
 buckets[int(bucket_num)].append(array[index])
 for bucket in buckets:
 insertion_sort(bucket)
 
 for bucket in buckets:
 if len(bucket) == 0:
 continue
 elif len(bucket) > 1:
 for num in bucket:
 sorted_buckets.append(num)
 else:
 sorted_buckets.append(bucket[0])
 return sorted_buckets
def insertion_sort(array):
 for unsorted_val in range(1, len(array)):
 val = array[unsorted_val]
 val_index = unsorted_val
 while val_index > 0 and array[val_index - 1] > val:
 array[val_index] = array[val_index - 1]
 val_index -= 1
 array[val_index] = val
toSort = bucket_sort(toSort)

Insertion
Merge
Quick
Selection

Searching

Binary
Linear

Merging

ArrayMerge

I was planning to include the code for the rest of the algorithms I've implemented but I have found code reviews for those already so I do not want to double up on the questions.

However, I did want to ask this on the meta page but I didn't have enough reputation to post. If I were to go through and update my code with the suggestions in the other code reviews, would it be okay for me to post here with my implementation and ask for further review?

Question 2

Welcome to CodeReview@SE - especially to post your implementation of algorithms featured time and again (one per question, if not for comparison). For parts of run time environments, there even is the reinventing-the-wheel tag for this. Heed How do I ask a Good Question? (First thing to improve would be the title of this question.)

Question 3

@greybeard Sorry I understood the rest of your question, but mind re-stating this part? "especially to post your implementation of algorithms featured time and again"

Question 4

It is quite alright to ask for a review of your implementation, without regard that there are umpteen reviews for different ones of the same algorithm. In my book, (non-trivial) differences in code layout or code comments are enough. (rest of [my] question ? I didn't ask anything.)

Question 5

Cool thank you for that, and yeah haha meant to write "comment/answer" instead. So to clarify, etiquette is to ask for a review of one program per question?

Question 6

You have implemented an unusual bucket sort. First, the logic to compute the bucket number makes assumptions about the values themselves and will fail on many types of numbers (for example, positive integers). And second, if N is the size of the input list, you are creating N buckets. Typically, bucket sort uses a number of buckets that is smaller than N. A common approach is to make an initial pass over the values to find their min and max. Then each bucket will have a span of (MAX - MIN) / K, where K is the number of buckets (which might be set either by the caller or by the code based on N). For any x value, I think its bucket index would be min(K - 1, int((x - MIN) / SPAN)) (you should double check that).

My other comments relate to code readability and simplicity.

Use convenience variables to eliminate repeated calculations, such as len(array). If you need it multiple times, create a variable and lighten the visual weight of your code.

Organize your code into commented "paragraphs" -- one paragraph per small step in the logic of your algorithm (shown below).

If you need to iterate over values in a collection, do it directly, not indirectly via indexes. Use for x in xs not for i in range(len(xs). If the algorithm requires both values an indexes, use enumerate(). Only iterate over indexes if you don't actually need the values or if the algorithm's readability is simpler that way (for example, in your insertion_sort() function).

Your code to reassemble the sorted buckets it overly complicated -- specifically, the size of the buckets is not important. The work can be done either with a list comprehension (as shown) or the equivalent use of 2 for loops.

Consider using a naming convention that I learned from functional programming: xs for a collection of things and x for one thing. Its extendable (ys and y, zs and z, etc) and it works quite nicely in generic situations like this where we know nothing about the substantive meaning of the values. This also lightens up the code weight -- enhancing readability without any loss of understandability.

The variable naming in insertion_sort() is backwards. You iterate over the indexes but call each index an unsorted_val. If it's an index, just call it index or, even better, i (a convention everyone understands). Then if you also need the value, get it with xs[i]. Again, notice how these short variable naming conventions can often enhance readability -- especially if the scope is small and well defined.

Finally, it is unusual to modify an index value during an iteration over indexes, as you do in insertion_sort(). It forces your reader to puzzle things over. I've seen more intuitive insertion sort implementations. For comparison, see the this pseudo-code. Note how the use of "swap" in that alternative implementation really helps the reader understand what's going on. Either adjust your code or add some guidance to your reader.

Here's an edit focused on the readability and simplicity issues only:

def bucket_sort(xs):
 # Convenience variables.
 N = len(xs)
 # Put values into buckets.
 buckets = [[] for _ in range(N)]
 for x in xs:
 i = int(N * x)
 buckets[i].append(x)
 # Sort each bucket.
 #
 # To keep hammering the point, `b` is a better variable
 # name than `bucket` within this tiny, well-defined context.
 for b in buckets:
 insertion_sort(b)
 
 # Return the sorted values.
 return [
 x
 for b in buckets
 for x in b
 ]
def insertion_sort(xs):
 # Only stylistic edits here.
 for i in range(1, len(xs)):
 x = xs[i]
 while i > 0 and xs[i - 1] > x:
 xs[i] = xs[i - 1]
 i -= 1
 xs[i] = x

Question 7

Hi thank you for the detailed reply, really appreciate it! I like the idea of using x in xs notation for sets of data. I also like organising the code into paragraphs as it gives me a better understanding of the algorithm itself. Would you be able to elaborate a little on how I can calculate K from N? and would you recommend having K just given to the function instead as it allows more flexibility to the caller?

Question 8

@VehicularIT If you're trying to making this function truly general purpose and robust, you probably should do some internet searching to see what the computer-science experts advise regarding bucket sizing. However, if you're just wanting a practical approach and you're mostly doing this for basic algorithm study, you could take a simpler approach. For example, make k an optional function argument, and then set it with a very basic rule based on N. Here's one that might not be very good, but it conveys the spirit: k = k or min(N // 3, 10).

FMc FMc 13.1k2 gold badges29 silver badges40 bronze badges · Accepted Answer · 2020-09-08 21:08:07Z