Greedy number partitioning algorithm

Question 1

I have two functions in Python that do the same thing: they partition a set of items of different sizes into a given number of subsets ("bins"), using an algorithm called greedy number partitioning. The algorithm works as follows: it loops over the items from large to small, and puts the next item into a bin that currently contains the smallest total size. The sizes are always positive integers.

Currently, I have two variants of this function. One variant just gets a list of the sizes, and returns a partition of the sizes:

def partition_list(num_of_bins: int, items: list[int]) -> list[list[int]]:
 """
 Partition the given items using the greedy number partitioning algorithm.
 >>> partition_list(2, [1,2,3,3,5,9,9])
 [[9, 5, 2], [9, 3, 3, 1]]
 >>> partition_list(3, [1,2,3,3,5,9,9])
 [[9, 2], [9, 1], [5, 3, 3]]
 """
 bins = [[] for i in range(num_of_bins)]
 sums = [0 for i in range(num_of_bins)]
 for item in sorted(items, reverse=True):
 index_of_least_full_bin = min(
 range(num_of_bins), key=lambda i: sums[i]
 )
 bins[index_of_least_full_bin].append(item)
 sums[index_of_least_full_bin] += item
 return bins

The other variant gets a dict mapping an item name to its size, and returns a partition of the item names:

def partition_dict(num_of_bins: int, items: dict[str, int]) -> list[list[int]]:
 """
 Partition the given items using the greedy number partitioning algorithm.
 >>> partition_dict(2, {"a":1, "b":2, "c":3, "d":3, "e":5, "f":9, "g":9})
 [['f', 'e', 'b'], ['g', 'c', 'd', 'a']]
 >>> partition_dict(3, {"a":1, "b":2, "c":3, "d":3, "e":5, "f":9, "g":9})
 [['f', 'b'], ['g', 'a'], ['e', 'c', 'd']]
 """
 bins = [[] for i in range(num_of_bins)]
 sums = [0 for i in range(num_of_bins)]
 pairs_sorted_by_value = sorted(items.items(), key=lambda pair: -pair[1])
 for (item, value) in pairs_sorted_by_value:
 index_of_least_full_bin = min(
 range(num_of_bins), key=lambda i: sums[i]
 )
 bins[index_of_least_full_bin].append(item)
 sums[index_of_least_full_bin] += value
 return bins

The algorithm is the same in both cases, so there is a lot of duplicate code. My main goal is to avoid code duplication. How can I write the algorithm once, but still be able to use it in both ways?

One solution I thought of was to write a function that accepts a list of pairs:

def partition_pairs(num_of_bins: int, pairs: list[tuple[str, int]]) -> list[list[int]]:
 """
 Partition the given items into bins using the greedy number partitioning algorithm.
 >>> partition_pairs(2, [("a",1), ("b",2), ("c",3), ("d",3), ("e",5), ("f",9), ("g",9)])
 [['f', 'e', 'b'], ['g', 'c', 'd', 'a']]
 >>> partition_pairs(3, [("a",1), ("b",2), ("c",3), ("d",3), ("e",5), ("f",9), ("g",9)])
 [['f', 'b'], ['g', 'a'], ['e', 'c', 'd']]
 """
 bins = [[] for i in range(num_of_bins)]
 sums = [0 for i in range(num_of_bins)]
 pairs_sorted_by_value = sorted(pairs, key=lambda pair: -pair[1])
 for (item, value) in pairs_sorted_by_value:
 index_of_least_full_bin = min(
 range(num_of_bins), key=lambda i: sums[i]
 )
 bins[index_of_least_full_bin].append(item)
 sums[index_of_least_full_bin] += value
 return bins

And then use it like this:

def partition_list(num_of_bins: int, items: list[int]) -> list[list[int]]:
 return partition_pairs(num_of_bins, [(item, item) for item in items])
def partition_dict(num_of_bins: int, items: dict[str, int]) -> list[list[int]]:
 return partition_pairs(num_of_bins, items.items())

This solves the code duplication, but it creates a new problem: data duplication. When calling partition_list, the data is duplicated, so that if the input list has 1,000,000 integers, a temporary list with 2,000,000 integers is created.

Is there a way to avoid both code duplication and data duplication?

Question 2

Will you actually be processing 1,000,000 integers? If so, numpy or another language will be called for.

Question 3

Where do the input dicts and lists come from? What are their typical sizes? Can you show examples? How will the output be used? What is the actual domain problem being solved?

Question 4

Will your items always be positive?

Question 5

Recommended reading: stackoverflow.com/questions/14120729/…

Question 6

Rather than recomputing the smallest bin each time, put the bins in a priority queue or heap so you can always extract the minimum. But probably one of the algorithms on the Multiway wikipedia page is better.

Question 7

The two functions differ mainly in their for-loop lines. Let's start by making a few cosmetic edits to increase the parallelism: (1) drop the pairs_sorted_by_value variable; (2) unpack the dict tuple explicitly; and (3) create a similar unpacking line for the list function.

# partition_list()
for item in sorted(items, reverse=True):
 item, value = (item, item)
# partition_dict()
for item in sorted(items.items(), key=lambda pair: -pair[1]):
 item, value = item

After those changes, we can clearly see the behavioral differences between the two functions: they need different sorting logic and different unpacking logic. Both are easily expressed as lambdas that can be passed to a utility function that does the actual work. Here's an illustration (minus the doc-strings and type declarations, purely for brevity here):

def partition_list(num_of_bins, items):
 sorter = lambda items: sorted(items, reverse=True)
 unpacker = lambda item: (item, item)
 return do_partition(num_of_bins, items, sorter, unpacker)
def partition_dict(num_of_bins, items):
 sorter = lambda items: sorted(items.items(), key=lambda pair: -pair[1])
 unpacker = lambda item: item
 return do_partition(num_of_bins, items, sorter, unpacker)
def do_partition(num_of_bins, items, sorter, unpacker):
 rng = range(num_of_bins)
 bins = [[] for i in rng]
 sums = [0 for i in rng]
 for item in sorter(items):
 item, value = unpacker(item)
 # Index of least full bin.
 i = min(rng, key = lambda j: sums[j])
 bins[i].append(item)
 sums[i] += value
 return bins

Speaking of brevity, do yourself a favor and drop the onerously long index_of_least_full_bin variable name. This is a context where a super-short name (optionally plus a comment) is better, at no loss of readability. In the same spirit, a convenience variable for range(num_of_bins) can further lighten up the code and enhance readability.

Question 8

Great ideas, thanks! Instead of using two parameters - sorter and unpacker, I thought of using a single parameter - evaluator, that maps an item to its value. So the loop would be for item in sorted(items, key=evaluator, reverse=True): ... bins[i].append(item); sums[i]+=evaluator(item). Then, partition_list would be do_partition(num_of_bins, items, lambda item: item), and partition_dict would be do_partition(num_of_bins, items, lambda item: items[item]). What do you think of this alternative?

Question 9

@ErelSegal-Halevi Yes, that's an even better idea than the code I drafted.

FMc FMc 13.1k2 gold badges29 silver badges40 bronze badges · Accepted Answer · 2022-02-11 00:21:29Z

The two functions differ mainly in their for-loop lines. Let's start by making a few cosmetic edits to increase the parallelism: (1) drop the pairs_sorted_by_value variable; (2) unpack the dict tuple explicitly; and (3) create a similar unpacking line for the list function.

# partition_list()
for item in sorted(items, reverse=True):
 item, value = (item, item)
# partition_dict()
for item in sorted(items.items(), key=lambda pair: -pair[1]):
 item, value = item

After those changes, we can clearly see the behavioral differences between the two functions: they need different sorting logic and different unpacking logic. Both are easily expressed as lambdas that can be passed to a utility function that does the actual work. Here's an illustration (minus the doc-strings and type declarations, purely for brevity here):

def partition_list(num_of_bins, items):
 sorter = lambda items: sorted(items, reverse=True)
 unpacker = lambda item: (item, item)
 return do_partition(num_of_bins, items, sorter, unpacker)
def partition_dict(num_of_bins, items):
 sorter = lambda items: sorted(items.items(), key=lambda pair: -pair[1])
 unpacker = lambda item: item
 return do_partition(num_of_bins, items, sorter, unpacker)
def do_partition(num_of_bins, items, sorter, unpacker):
 rng = range(num_of_bins)
 bins = [[] for i in rng]
 sums = [0 for i in rng]
 for item in sorter(items):
 item, value = unpacker(item)
 # Index of least full bin.
 i = min(rng, key = lambda j: sums[j])
 bins[i].append(item)
 sums[i] += value
 return bins

Speaking of brevity, do yourself a favor and drop the onerously long index_of_least_full_bin variable name. This is a context where a super-short name (optionally plus a comment) is better, at no loss of readability. In the same spirit, a convenience variable for range(num_of_bins) can further lighten up the code and enhance readability.

Great ideas, thanks! Instead of using two parameters - sorter and unpacker, I thought of using a single parameter - evaluator, that maps an item to its value. So the loop would be for item in sorted(items, key=evaluator, reverse=True): ... bins[i].append(item); sums[i]+=evaluator(item). Then, partition_list would be do_partition(num_of_bins, items, lambda item: item), and partition_dict would be do_partition(num_of_bins, items, lambda item: items[item]). What do you think of this alternative?
@ErelSegal-Halevi Yes, that's an even better idea than the code I drafted.

Stack Exchange Network

Greedy number partitioning algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Greedy number partitioning algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions