With huge help from Stack Overflow, I have created code which takes possible swap matches with given weights as an input and outputs a set of matches which gives the highest weight with the highest number of swaps.
A swap is a duty being transferred from one person to another. Every person needs to have exactly one duty.
The difference with a weighted graph is the fact that the highest number of swaps has higher priority.
To do this I multiply every swap with a high number and add the weight of that swap for the LpMaximize
.
The code selects a set of graph cycles which gives the highest number of total swaps. The cycle length is limited by max_cycle_order
.
The selected cycles are added to the result
array.
Questions
- The method of introducing the weights and prioritizing feels cumbersome. How could this be improved?
- It feels like formatting the output takes too many steps. How could this be improved?
- With a high max_cycle_order and a high number of matches (7+ and 700+) the code takes a long time to run. In my use case at the moment not a problem because I'm expecting to use max_cycle_order between 2 and 5 but I'm interested in any ideas to limit the runtime for a high
max_cycle_order
. - Any other tips?
import networkx
import pulp
from datetime import datetime
def optimize(matches, max_steps=2):
start_time = datetime.now()
# find sum of all weights to be used to prioritize max swaps
max_weight = sum(weight for (match_id, a, b, weight) in matches) + 1
graph = networkx.DiGraph()
for (match_id, a, b, weight) in matches:
graph.add_edge(a, b)
# default is one-on-one swaps (max_cycle_order = 2)
max_cycle_order = 2
if isinstance(max_steps, int):
max_cycle_order = max_steps
if max_steps < 2:
max_cycle_order = 2
if max_steps > 6:
max_cycle_order = 6
cycles = tuple(networkx.simple_cycles(graph, length_bound=max_cycle_order))
cycle_names = ['_'.join(c) for c in cycles]
# Create weight for each cycle to be used by LP solver
cycle_orders = []
for c in cycles:
sum_weights = 0
for vertex in c:
for (match_id, v1, v2, weight) in matches:
if vertex == v2:
sum_weights += weight
break
cycle_orders.append(max_weight * len(c) + sum_weights)
selectors = pulp.LpVariable.matrix(
name='cycle', indices=cycle_names, cat=pulp.LpBinary,
)
prob = pulp.LpProblem(name='swaps', sense=pulp.LpMaximize)
prob.setObjective(pulp.lpDot(selectors, cycle_orders))
for vertex in graph.nodes:
# create list of all cycles where vertex (request) is used
group = [
selector
for selector, cycle in zip(selectors, cycles)
if vertex in cycle
]
# maximize total of these cycles to 1 or smaller to not double use a request
prob.addConstraint(
name=f'excl_{vertex}',
constraint=pulp.lpSum(group) <= 1,
)
prob.solve()
assert prob.status == pulp.LpStatusOptimal
# process selected cycles for output
print('Selected cycles:')
swap_count = 0
result = []
for cycle, selector in zip(cycles, selectors):
if selector.value() > 0.5:
cycle_result = []
swap_count += len(cycle)
print(', '.join(cycle))
for c in range(0, len(cycle)):
d = c + 1
if d > len(cycle) - 1:
d = 0
for (match_id, r1, r2, weight) in matches:
if cycle[c] == r1 and cycle[d] == r2:
cycle_result.append({'id': match_id, 'from': r1, 'to': r2})
break
result.append(cycle_result)
end_time = datetime.now()
runtime = (end_time - start_time).total_seconds()
print('Duration: {}'.format(end_time - start_time))
output = {
"status": 1,
"result": result,
"swapCount": swap_count,
"maxSteps": max_cycle_order,
"runtime": runtime
}
return output
# test function call
print(optimize([
('id1', 'Amy', 'Blake', 2), ('id2', 'Blake', 'Claire', 5), ('id3', 'Claire', 'Drew', 5), ('id4', 'Drew', 'Emma', 7),
('id5', 'Emma', 'Flynn', 3), ('id6', 'Flynn', 'Gabi', 4), ('id7', 'Gabi', 'Hana', 5), ('id8', 'Hana', 'Izzy', 3),
('id16', 'Izzy', 'Jill', 6), ('id17', 'Jill', 'Amy', 3),
# one on one
('id9', 'Blake', 'Amy', 3), ('id10', 'Claire', 'Blake', 2), ('id11', 'Emma', 'Drew', 5),
# three point
('id12', 'Gabi', 'Emma', 7), ('id13', 'Drew', 'Blake', 7),
# four point
('id14', 'Flynn', 'Claire', 1), ('id15', 'Jill', 'Gabi', 4)
], 4))
-
1\$\begingroup\$ Re. long runtime, have you profiled? You'll need to determine whether the slow part is NetworkX or PuLP. \$\endgroup\$Reinderien– Reinderien2024年04月10日 19:10:16 +00:00Commented Apr 10, 2024 at 19:10
-
\$\begingroup\$ On my M2 MacBook Air a set of 730 matches resulting in 39 swaps with max_cycle_order 7 takes 23 seconds. networkx is done after 2. CBC Solver states it took 12 seconds but probably in total it's a bit more. \$\endgroup\$Oehoe– Oehoe2024年04月10日 19:31:41 +00:00Commented Apr 10, 2024 at 19:31
3 Answers 3
helpers
This optimize()
is Too Long, plus it would benefit from a docstring.
To read it one must vertically scroll.
It defines nearly three dozen locals.
Psychologists are not in complete agreement whether the Gentle Reader
can simultaneously hold seven things in short-term memory or perhaps
a few more, but past thirty we have definitely exceeded the limit.
Rely on Extract Helper to produce get_cycles()
.
Break out the for
loops as (unit testable) helpers, too.
A desirable side effect will be that local variables
go out-of-scope when we return
from helper,
leaving the reader with fewer concerns to keep track of.
loss function
The big thing I was missing, from the zero-length docstring and from the Review Context, was this. Suppose we have six people listed, and they all change their duty. Would it be cheaper to perform three pair-wise swaps, or a pair of three-person swaps?
Describe the business context and spell out the loss function, so maintenance engineers will be less likely to introduce code defects when they make edits in the coming months.
test suite
The print()
is very nice and helpful.
But it requires one to ignore some solver debugs,
and to eyeball a dict
of graph edges when
deciding if the output is correct or not.
An automated test "knows" the right answer
and can display a Green bar by itself.
Consider hardcoding the expected answer
into your test code.
Consider doing import unitttest
.
-
\$\begingroup\$ Before reading up on docstrings and unittest.. I have not considered a loss function. In the bussinesscontext it would be better to have three pair-wise swaps. I can probably introduce this in the cycle_orders array which is used in the LpMaximize problem.
cycle_orders.append(max_weight * len(c) + sum_weights)
\$\endgroup\$Oehoe– Oehoe2024年04月10日 19:48:41 +00:00Commented Apr 10, 2024 at 19:48
The following suggestions are mostly for style as opposed to performance.
Documentation
As the previous answer mentions, add a docstring to the function describing its purpose, inputs and outputs.
Intent
The 2 if
statements are not independent of each other:
if max_steps < 2:
max_cycle_order = 2
if max_steps > 6:
max_cycle_order = 6
They should be combined in an if/else
:
if max_steps < 2:
max_cycle_order = 2
elif max_steps > 6:
max_cycle_order = 6
This better conveys the intent, and it is more efficient.
Add whitespace around the assignment operator for better readability. Change:
name=f'excl_{vertex}',
to:
name = f'excl_{vertex}',
DRY
This code shows up 3 times:
len(cycle)
Set it to a variable, then use the variable:
num_cycles = len(cycle)
Simpler
There is no need for the 0 in: for c in range(0, len(cycle)):
It is simpler as:
for c in range(len(cycle)):
It is simpler to use an f-string. Change:
print('Duration: {}'.format(end_time - start_time))
to:
print(f'Duration: {end_time - start_time}')
You should move your dict-rendering code out of the main solve function, along with runtime calculation and printing.
I recall that this was intended for JSON, so don't status=1
; JSON has booleans so use True
. Also, I think it would make more sense to perform an outer try/catch/assign status=False
than to either throw-or-return. Otherwise, why is status
there?
Don't use datetime.now()
for performance timing. Use time.monotonic
for long scales like this, or time.perf_counter()
for shorter scales if you're measuring pieces of code in a more granular manner.
-
\$\begingroup\$ I planned to use status=0 for any errors, for example status=1 when the solver gets a result but it might not be optimal (future use) and status=2 for an optimal solution. status=0 is set in the catch block of the code which calls the optimize function. \$\endgroup\$Oehoe– Oehoe2024年04月11日 07:22:33 +00:00Commented Apr 11, 2024 at 7:22