I have two functions in Python that do the same thing: they partition a set of items of different sizes into a given number of subsets ("bins"), using an algorithm called greedy number partitioning. The algorithm works as follows: it loops over the items from large to small, and puts the next item into a bin that currently contains the smallest total size. The sizes are always positive integers.
Currently, I have two variants of this function. One variant just gets a list of the sizes, and returns a partition of the sizes:
def partition_list(num_of_bins: int, items: list[int]) -> list[list[int]]:
"""
Partition the given items using the greedy number partitioning algorithm.
>>> partition_list(2, [1,2,3,3,5,9,9])
[[9, 5, 2], [9, 3, 3, 1]]
>>> partition_list(3, [1,2,3,3,5,9,9])
[[9, 2], [9, 1], [5, 3, 3]]
"""
bins = [[] for i in range(num_of_bins)]
sums = [0 for i in range(num_of_bins)]
for item in sorted(items, reverse=True):
index_of_least_full_bin = min(
range(num_of_bins), key=lambda i: sums[i]
)
bins[index_of_least_full_bin].append(item)
sums[index_of_least_full_bin] += item
return bins
The other variant gets a dict mapping an item name to its size, and returns a partition of the item names:
def partition_dict(num_of_bins: int, items: dict[str, int]) -> list[list[int]]:
"""
Partition the given items using the greedy number partitioning algorithm.
>>> partition_dict(2, {"a":1, "b":2, "c":3, "d":3, "e":5, "f":9, "g":9})
[['f', 'e', 'b'], ['g', 'c', 'd', 'a']]
>>> partition_dict(3, {"a":1, "b":2, "c":3, "d":3, "e":5, "f":9, "g":9})
[['f', 'b'], ['g', 'a'], ['e', 'c', 'd']]
"""
bins = [[] for i in range(num_of_bins)]
sums = [0 for i in range(num_of_bins)]
pairs_sorted_by_value = sorted(items.items(), key=lambda pair: -pair[1])
for (item, value) in pairs_sorted_by_value:
index_of_least_full_bin = min(
range(num_of_bins), key=lambda i: sums[i]
)
bins[index_of_least_full_bin].append(item)
sums[index_of_least_full_bin] += value
return bins
The algorithm is the same in both cases, so there is a lot of duplicate code. My main goal is to avoid code duplication. How can I write the algorithm once, but still be able to use it in both ways?
One solution I thought of was to write a function that accepts a list of pairs:
def partition_pairs(num_of_bins: int, pairs: list[tuple[str, int]]) -> list[list[int]]:
"""
Partition the given items into bins using the greedy number partitioning algorithm.
>>> partition_pairs(2, [("a",1), ("b",2), ("c",3), ("d",3), ("e",5), ("f",9), ("g",9)])
[['f', 'e', 'b'], ['g', 'c', 'd', 'a']]
>>> partition_pairs(3, [("a",1), ("b",2), ("c",3), ("d",3), ("e",5), ("f",9), ("g",9)])
[['f', 'b'], ['g', 'a'], ['e', 'c', 'd']]
"""
bins = [[] for i in range(num_of_bins)]
sums = [0 for i in range(num_of_bins)]
pairs_sorted_by_value = sorted(pairs, key=lambda pair: -pair[1])
for (item, value) in pairs_sorted_by_value:
index_of_least_full_bin = min(
range(num_of_bins), key=lambda i: sums[i]
)
bins[index_of_least_full_bin].append(item)
sums[index_of_least_full_bin] += value
return bins
And then use it like this:
def partition_list(num_of_bins: int, items: list[int]) -> list[list[int]]:
return partition_pairs(num_of_bins, [(item, item) for item in items])
def partition_dict(num_of_bins: int, items: dict[str, int]) -> list[list[int]]:
return partition_pairs(num_of_bins, items.items())
This solves the code duplication, but it creates a new problem: data duplication. When calling partition_list
, the data is duplicated, so that if the input list has 1,000,000 integers, a temporary list with 2,000,000 integers is created.
Is there a way to avoid both code duplication and data duplication?
-
1\$\begingroup\$ Will you actually be processing 1,000,000 integers? If so, numpy or another language will be called for. \$\endgroup\$Reinderien– Reinderien2022年02月10日 22:18:09 +00:00Commented Feb 10, 2022 at 22:18
-
1\$\begingroup\$ Where do the input dicts and lists come from? What are their typical sizes? Can you show examples? How will the output be used? What is the actual domain problem being solved? \$\endgroup\$Reinderien– Reinderien2022年02月10日 22:19:33 +00:00Commented Feb 10, 2022 at 22:19
-
\$\begingroup\$ Will your items always be positive? \$\endgroup\$Reinderien– Reinderien2022年02月10日 22:46:12 +00:00Commented Feb 10, 2022 at 22:46
-
\$\begingroup\$ Recommended reading: stackoverflow.com/questions/14120729/… \$\endgroup\$Reinderien– Reinderien2022年02月10日 22:53:10 +00:00Commented Feb 10, 2022 at 22:53
-
2\$\begingroup\$ Rather than recomputing the smallest bin each time, put the bins in a priority queue or heap so you can always extract the minimum. But probably one of the algorithms on the Multiway wikipedia page is better. \$\endgroup\$Snowbody– Snowbody2022年02月11日 01:48:07 +00:00Commented Feb 11, 2022 at 1:48
1 Answer 1
The two functions differ mainly in their for-loop lines. Let's start by making
a few cosmetic edits to increase the parallelism: (1) drop the
pairs_sorted_by_value
variable; (2) unpack the dict tuple explicitly; and (3)
create a similar unpacking line for the list function.
# partition_list()
for item in sorted(items, reverse=True):
item, value = (item, item)
# partition_dict()
for item in sorted(items.items(), key=lambda pair: -pair[1]):
item, value = item
After those changes, we can clearly see the behavioral differences between the two functions: they need different sorting logic and different unpacking logic. Both are easily expressed as lambdas that can be passed to a utility function that does the actual work. Here's an illustration (minus the doc-strings and type declarations, purely for brevity here):
def partition_list(num_of_bins, items):
sorter = lambda items: sorted(items, reverse=True)
unpacker = lambda item: (item, item)
return do_partition(num_of_bins, items, sorter, unpacker)
def partition_dict(num_of_bins, items):
sorter = lambda items: sorted(items.items(), key=lambda pair: -pair[1])
unpacker = lambda item: item
return do_partition(num_of_bins, items, sorter, unpacker)
def do_partition(num_of_bins, items, sorter, unpacker):
rng = range(num_of_bins)
bins = [[] for i in rng]
sums = [0 for i in rng]
for item in sorter(items):
item, value = unpacker(item)
# Index of least full bin.
i = min(rng, key = lambda j: sums[j])
bins[i].append(item)
sums[i] += value
return bins
Speaking of brevity, do yourself a favor and drop the onerously long
index_of_least_full_bin
variable name. This is a context where
a super-short name (optionally plus a comment) is better, at no loss of readability. In the same spirit, a convenience variable for range(num_of_bins)
can further lighten up the code and enhance readability.
-
1\$\begingroup\$ Great ideas, thanks! Instead of using two parameters - sorter and unpacker, I thought of using a single parameter - evaluator, that maps an item to its value. So the loop would be
for item in sorted(items, key=evaluator, reverse=True): ... bins[i].append(item); sums[i]+=evaluator(item)
. Then, partition_list would bedo_partition(num_of_bins, items, lambda item: item)
, and partition_dict would bedo_partition(num_of_bins, items, lambda item: items[item])
. What do you think of this alternative? \$\endgroup\$Erel Segal-Halevi– Erel Segal-Halevi2022年02月11日 08:59:08 +00:00Commented Feb 11, 2022 at 8:59 -
\$\begingroup\$ @ErelSegal-Halevi Yes, that's an even better idea than the code I drafted. \$\endgroup\$FMc– FMc2022年02月11日 23:48:08 +00:00Commented Feb 11, 2022 at 23:48