2
\$\begingroup\$

I have an array with integer labels in [0, n-1], say arr1=[1, 0, 3, 2] with n = 3. And then I have another array again with integer labels in [0, n-1], say arr2=[0, 0, 2, 3]. How can I find the minimum re-labeling of arr2 in order to make it as close as possible to arr1 in terms of number of nonzero elements of their difference? For example, here the new arr2 is arr2* = [0, 0, 3, 2].

Here is the code that I'm already using:

from itertools import permutations
import numpy as np
arr1 = np.array([1, 0, 3, 2])
n_labels = 4
perms = list(permutations(np.arange(n_labels)))
arr2 = np.array([0, 0, 2, 3]) 
best_perm = None
best_err = 1e8
for perm in perms:
 arr2_new = [perm[el] for el in arr2]
 err = np.linalg.norm(arr2_new - arr1, ord = 0)
 if err < best_err:
 best_err = err
 best_perm = perm
 if best_err == 0:
 break
arr2_new = [best_perm[el] for el in arr2]

Is there any way to make it more efficient?

UPDATED: Some notes

  1. I need the least number of nonzero elements of arr1-arr2*. That's why I use the l0-norm.

  2. I want a re-labeling scheme: every 0 should become 1, for example, and so on and so forth.

  3. I don't want the positions of the elements of arr2 to change.

  4. I also care about best_perm. Because I have another array, say c = [0.1, 1.1, 2.1, 3.1] in which each element corresponds to a label in arr2. So when I change arr2 to arr2*, I also want to change c to c* = [c[best_perm[j]] for j in range(n_labels)]

  5. The arrays arr1 and arr2 may contain repeating elements. But the elements of both are in [0, n-1].

  6. The solution may not be unique. We are interested in just one of the possible solutions.

asked Feb 4, 2022 at 20:11
\$\endgroup\$
11
  • \$\begingroup\$ What will your typical array lengths actually be? \$\endgroup\$ Commented Feb 4, 2022 at 21:30
  • \$\begingroup\$ Do you mean integer? Yes. Say 1000-2000. \$\endgroup\$ Commented Feb 4, 2022 at 21:31
  • \$\begingroup\$ Yes, yes, they are integer always. \$\endgroup\$ Commented Feb 4, 2022 at 21:34
  • \$\begingroup\$ I meant, every element in arr2 with value equal to 0 should become 1, for example, and so on and so forth. Regarding best_perm, please see comment 4 in the original question. \$\endgroup\$ Commented Feb 4, 2022 at 21:56
  • \$\begingroup\$ Is arr1 guaranteed to always have unique elements? \$\endgroup\$ Commented Feb 6, 2022 at 14:47

1 Answer 1

1
\$\begingroup\$

Yes. First of all, your norm() call implies Frobenius: but any kind of least-squares is totally uncalled for in this problem since you just want integer comparison.

To put it lightly, the description of what this is actually supposed to do is troubled. For instance, rather than "labelling" I would call it "substitution". I hope the extended back-and-forth has prepared you for the kind of information reviewers need to know in future questions.

A better algorithm is definitely possible. First, frame the problem as a bipartite graph, where:

  • the left subset nodes are unique values from the original arr2
  • the right subset nodes are all possible destination values from 0 through n-1
  • the graph starts as being complete and unbalanced; the left cardinality will typically be smaller than the right cardinality
  • the edge weights are non-negative integers representing the number of mismatches between the source and destination

The problem becomes: how do you prune this graph such that

  • for every left-node, there remains exactly one edge (the left subset becomes 1-regular);
  • for every right-node, there remains at most one edge;
  • the graph becomes disconnected and acyclic; and
  • the edges are chosen such that the total remaining weight is minimised.

This is a known problem - a variant of bipartite maximum matching called Bipartite Min-Cost Perfect Matching, also called the assignment problem. Crucially, the insight from interpreting this as a bipartite matching problem is that it can be solved using mixed-integer linear programming (MIP).

MIP problems are well-understood and have excellent software that already exists to solve them. I recommend glpk and its Python SWIG bindings in swiglpk.

In this strategy,

  • c, the cost vector, will be non-negative integers capturing your edge weights
  • z, the objective, is the total distance
  • xs, the structural variables/columns, will be a flattened 2d matrix of boolean edge choices
  • xr, the auxiliary variables/rows, will be the node degree (sum of edge counts per node)
  • A, the constraint matrix, will be constructed to sum up the node degrees over both axes

xs, when interpreted as a 2D matrix, must be constrained to have exactly one nonzero per (source) row, and between 0 and 1 nonzeros per (dest) column. In glpk these are expressed with fixed and double bounds, respectively.

It takes a little bit of code to express all of this, but it's quite worth it: when I run a problem of n=500 on my machine it completes in about two seconds.

Suggested

import random
from collections import defaultdict
from timeit import timeit
from typing import Optional, Iterator
import swiglpk as lp
class OptError(Exception):
 pass
def make_map() -> tuple[tuple[int, list[int]], ...]:
 """
 From arr1 and arr2, produce an ordered map in tuple-of-pairs format
 where the key is the unique source arr2 value, and the values are all
 (potentially non-unique) arr1 values to which a match will be attempted.
 """
 targets = defaultdict(list)
 for source, dest in zip(arr2, arr1):
 targets[source].append(dest)
 return tuple(targets.items())
def fill_sources() -> None:
 """
 For each of our unique arr2 source values,
 - Fill some the GLPK problem's "auxiliary rows" of count n_source. These
 rows correspond to constraints on the arr2 source values.
 - Name those rows and apply fixed bounds: each source must get exactly one
 match.
 For each combination of source and candidate destination,
 - Reinterpret the source/destination matrix as a flat, horizontal vector
 - Fill all of the GLPK problem's "structural columns" of count
 n_source*n_dest.
 - Name those columns and set the structural variable type as binary-value.
 - Set the corresponding objective coefficient to the mismatch cost of
 choosing this pair.
 - Populate the constraint matrix to enforce that each source gets one dest,
 and every dest gets up to one source
 """
 row_indices = lp.intArray(3)
 row_values = lp.doubleArray(3)
 row_values[1] = row_values[2] = 1
 # For each source in the map
 for y, (source, dests) in enumerate(target_map):
 lp.glp_set_row_name(problem, y+1, f'{source}->')
 # First n_source auxiliary rows are fixed-bounded at 1
 lp.glp_set_row_bnds(
 P=problem, i=y+1, type=lp.GLP_FX, lb=1, ub=1,
 )
 # For each possible destination (not just those in the map)
 for candidate_dest in range(n_dest):
 x = y*n_dest + candidate_dest + 1
 # This structural variable is a Binary Variable
 lp.glp_set_col_kind(problem, x, lp.GLP_BV)
 lp.glp_set_col_name(problem, x, f"{source}->{candidate_dest}")
 # Objective coefficient is the sum of mismatched elements
 cost = sum(
 candidate_dest != actual_dest
 for actual_dest in dests
 )
 lp.glp_set_obj_coef(problem, x, cost)
 # Set the constraint matrix elements (sparse)
 row_indices[1] = 1 + y # sum the sources
 row_indices[2] = 1 + n_source + candidate_dest # sum the destinations
 lp.glp_set_mat_col(
 P=problem, j=x, len=2, ind=row_indices, val=row_values,
 )
def fill_dests() -> None:
 """
 For each of our candidate destination values,
 - Fill the rest of the GLPK problem's "auxiliary rows" of count n_dest.
 These rows correspond to constraints on the arr1 dest values.
 - Name those rows and apply double bounds: each dest must get up to one
 match.
 """
 # For each possible destination
 for candidate_dest in range(n_dest):
 y = 1 + n_source + candidate_dest
 lp.glp_set_row_name(problem, y, f'->{candidate_dest}')
 # Last n_dest auxiliary rows are double-bounded between 0 and 1
 lp.glp_set_row_bnds(
 P=problem, i=y, type=lp.GLP_DB, lb=0, ub=1,
 )
def create() -> None:
 """
 Set some problem parameters, including objective characteristics; set the
 structural and auxiliary size; call into subroutines to fill all problem
 constraints.
 """
 lp.glp_set_prob_name(problem, 'proximal_substitution')
 lp.glp_set_obj_name(problem, 'distance')
 lp.glp_set_obj_dir(problem, lp.GLP_MIN)
 lp.glp_add_cols(problem, n_source * n_dest)
 lp.glp_add_rows(problem, n_source + n_dest)
 fill_sources()
 fill_dests()
def check_code(code: int) -> None:
 """
 If an optimisation function returned a failure code, attempt to convert
 that into a code name and raise an OptError
 """
 if code != 0:
 codes = {
 getattr(lp, k): k
 for k in dir(lp)
 if k.startswith('GLP_E')
 }
 raise OptError(f'glpk returned {codes[code]}')
def check_status(expected: int, actual: int) -> None:
 """
 If an optimisation status is not as expected, attempt to convert the status
 into a status name and raise an OptError
 """
 if expected != actual:
 statuses = {
 getattr(lp, f'GLP_{stat}'): stat
 for stat in ('OPT', 'FEAS', 'INFEAS', 'NOFEAS', 'UNBND', 'UNDEF')
 }
 raise OptError(
 f'Optimisation failed: status {statuses.get(actual)}'
 )
def solve(log_level: int = lp.GLP_MSG_ON) -> None:
 """
 Perform a two-pass linear programming solver, with a simplex basis and then
 a mixed-integer programming (MIP) discretisation.
 """
 # First pass to calculate the basis segment: do not use the default MIP
 # pre-solver because we want to use the faster primal method instead of dual
 parm = lp.glp_smcp()
 lp.glp_init_smcp(parm)
 parm.msg_lev = log_level
 check_code(lp.glp_simplex(problem, parm))
 check_status(lp.GLP_FEAS, lp.glp_get_prim_stat(problem))
 check_status(lp.GLP_OPT, lp.glp_get_status(problem))
 # Second (very fast) MIP pass
 parm = lp.glp_iocp()
 lp.glp_init_iocp(parm)
 parm.msg_lev = log_level
 check_code(lp.glp_intopt(problem, parm))
 check_status(lp.GLP_OPT, lp.glp_mip_status(problem))
def construct_substitution() -> Iterator[tuple[int, int]]:
 for x in range(n_source * n_dest):
 matched = lp.glp_mip_col_val(problem, x+1)
 if matched:
 i_source, dest = divmod(x, n_dest)
 source, _ = target_map[i_source]
 yield source, dest
def output(substitution: dict[int, int], log_file: Optional[str] = None) -> None:
 if log_file:
 lp.glp_print_mip(problem, log_file)
 print('arr2, arr1, arr2new:')
 print(' '.join(f'{d:3}' for d in arr2))
 print(' '.join(f'{d:3}' for d in arr1))
 print(' '.join(
 f'{substitution[d]:3}'
 for d in arr2
 ))
def main():
 create()
 t = timeit(solve, number=1)
 print(f'\nSolution in {t:.2}s')
 substitution = dict(construct_substitution())
 output(substitution)
if __name__ == '__main__':
 N = 500
 random.seed(0)
 arr1 = [random.randrange(N) for _ in range(N)]
 arr2 = [random.randrange(N) for _ in range(N)]
 target_map = make_map()
 n_dest = len(arr1)
 n_source = len(target_map)
 problem = lp.glp_create_prob()
 main()

Output

Shown with GLP_MSG_ALL:

GLPK Integer Optimizer 5.0
319 rows, 23800 columns, 47600 non-zeros
23800 integer variables, all of which are binary
Preprocessing...
319 rows, 23800 columns, 47600 non-zeros
23800 integer variables, all of which are binary
Scaling...
 A: min|aij| = 1.000e+00 max|aij| = 1.000e+00 ratio = 1.000e+00
Problem data seem to be well scaled
Constructing initial basis...
Size of triangular part is 319
Solving LP relaxation...
GLPK Simplex Optimizer 5.0
319 rows, 23800 columns, 47600 non-zeros
 0: obj = 2.000000000e+02 inf = 1.180e+02 (1)
Perturbing LP to avoid stalling [299]...
 352: obj = 1.990000000e+02 inf = 0.000e+00 (0)
Removing LP perturbation [796]...
* 796: obj = 1.020000000e+02 inf = 0.000e+00 (0) 2
OPTIMAL LP SOLUTION FOUND
Integer optimization begins...
Long-step dual simplex will be used
+ 796: mip = not found yet >= -inf (1; 0)
+ 796: >>>>> 1.020000000e+02 >= 1.020000000e+02 0.0% (1; 0)
+ 796: mip = 1.020000000e+02 >= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
Writing MIP solution to 'problem-log.txt'...
arr2, arr1, arr2new:
 89 111 46 15 128 119 10 152 25 179 100 51 66 91 187 120 145 43 178 172 52 196 14 173 40 41 87 135 64 30 152 113 170 44 3 120 174 104 145 130 79 166 91 99 168 64 39 143 176 3 117 189 20 85 189 11 139 71 34 61 195 123 90 156 73 172 91 151 162 158 33 183 79 99 191 106 166 20 0 152 49 178 85 40 61 57 163 114 96 181 172 145 106 8 102 179 145 107 197 169 181 11 42 114 16 66 179 40 114 135 124 143 154 193 0 9 126 83 79 119 12 106 48 140 162 21 185 33 3 102 173 106 80 0 54 3 183 193 0 172 135 156 25 48 30 155 166 50 77 71 176 46 25 121 101 160 20 5 70 115 29 65 34 167 133 166 165 88 29 39 71 4 10 10 52 174 66 142 80 93 145 10 191 179 155 167 126 182 164 117 163 111 95 137 45 53 96 150 74 2
 98 194 107 10 66 130 124 103 77 122 91 149 55 129 35 72 35 193 24 158 64 136 180 154 37 79 25 186 18 175 84 120 143 25 90 111 80 156 163 52 141 122 113 133 66 15 140 3 23 184 102 181 171 160 0 156 126 85 62 186 83 180 16 48 145 56 61 36 139 114 23 20 81 130 125 27 77 141 74 180 31 140 85 138 52 154 140 150 73 113 23 152 98 81 147 61 74 47 48 47 8 156 168 66 121 17 22 173 193 33 38 9 20 179 138 174 100 180 134 70 133 60 55 173 150 107 148 70 115 126 169 164 179 91 21 83 156 29 124 150 161 85 48 62 4 187 69 29 180 56 95 43 85 109 15 25 37 178 56 11 146 162 136 154 174 18 6 31 162 48 155 147 30 100 23 94 29 9 155 5 49 47 183 31 122 53 186 15 173 5 139 108 158 25 66 17 56 18 165 77
 98 108 43 10 4 70 30 84 48 61 91 149 12 129 35 111 49 193 24 150 64 136 22 169 138 79 26 33 18 175 84 120 143 32 90 111 94 105 49 52 134 69 129 130 40 18 140 3 95 90 102 181 37 160 181 156 126 56 62 186 83 54 16 85 145 150 129 36 139 114 23 20 134 130 125 60 69 37 74 84 46 24 160 138 186 154 68 66 73 113 150 49 60 81 8 61 49 72 75 47 113 156 168 66 121 12 61 138 66 33 38 3 80 179 74 82 100 180 134 70 133 60 55 87 139 107 148 23 90 8 169 60 155 74 21 90 20 179 74 150 33 85 48 55 175 122 69 29 27 56 95 43 48 109 96 25 37 178 99 11 146 162 62 53 174 69 6 31 146 140 56 147 30 30 64 94 12 9 155 5 49 30 125 61 122 53 100 15 173 102 68 108 158 112 199 17 73 116 165 77

Abridged Log

Problem: proximal_substitution
Rows: 319
Columns: 23800 (23800 integer, 23800 binary)
Non-zeros: 47600
Status: INTEGER OPTIMAL
Objective: distance = 102 (MINimum)
 No. Row name Activity Lower bound Upper bound
------ ------------ ------------- ------------- -------------
 1 89-> 1 1 = 
 2 111-> 1 1 = 
 3 46-> 1 1 = 
 4 15-> 1 1 = 
 5 128-> 1 1 = 
...
 23796 2->195 * 0 0 1 
 23797 2->196 * 0 0 1 
 23798 2->197 * 0 0 1 
 23799 2->198 * 0 0 1 
 23800 2->199 * 0 0 1 
Integer feasibility conditions:
KKT.PE: max.abs.err = 0.00e+00 on row 0
 max.rel.err = 0.00e+00 on row 0
 High quality
KKT.PB: max.abs.err = 0.00e+00 on row 0
 max.rel.err = 0.00e+00 on row 0
 High quality
End of output
answered Feb 4, 2022 at 21:20
\$\endgroup\$
6
  • \$\begingroup\$ Maybe I wasn't very clear, I'm sorry about it: 1) I need the least number of nonzero elements of arr1-arr2*. That's why I use the l0-norm. 2) I want a re-labeling scheme: every 0 should become 1, for example, and so on and so forth. 3) I don't want the positions of the elements of arr2 to change. Does your solution address the above already? Could you provide a code snippet for the included example? Thanks! \$\endgroup\$ Commented Feb 4, 2022 at 21:26
  • \$\begingroup\$ I think you are missing something. My original code is correct. n_labels should be equal to the unique elements in arr1 (or arr2, equivalently). This is why, [best_perm[el] for el in arr2] did not make sense to you, because you were using a different definition of n_labels, and consequently of perms. As an example where your code fails: for arr1 = [1, 0, 3, 2, 1] and arr2 = [0, 0, 2, 3, 1], we should get arr2* = [0, 0, 3, 2, 1], i.e., every 0 in arr2 should become 1. Your code produces arr2* = [1, 0, 3, 2, 0]. In other words, the labeling of arr2 should stay the same. \$\endgroup\$ Commented Feb 5, 2022 at 0:33
  • \$\begingroup\$ If every 0 in arr2 became 1, that would produce 11231 which makes even less sense. You're going to have to spend some time in your question describing what you mean by labelling. \$\endgroup\$ Commented Feb 5, 2022 at 4:36
  • \$\begingroup\$ n_labels should be equal to the unique elements in arr1 - if so, your code is wrong because it hard-codes for 4 and performs no unique operation. \$\endgroup\$ Commented Feb 5, 2022 at 4:53
  • \$\begingroup\$ You are right, a made a typo above. At any rate, what I am asking is rather simple (in principle) and I believe that your code is very close to solving my problem. So let me try to put it in words: We have n_labels clusters and len(arr1) assignments to make. arr1 already makes some assignments. arr2 makes some other assignments. How should we relabel arr2 so that it is as close as possible to arr1? With relabeling, I mean a scheme that assigns every label of arr2 to another label in the same set. This is exactly what my code does and already solves my problem, but it is a brute-force approach. \$\endgroup\$ Commented Feb 5, 2022 at 5:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.