I have a matrix (input):
| -- | c1 | c2 | c3 |
|---|---|---|---|
| r1 | AA | BB | CC |
| r2 | CC | RR | BB |
| r3 | EE | DD | FF |
| r4 | KK | DD | EE |
| r5 | DD | GG | KK |
| r6 | PP | KK |
- Let's call each matrix cell a
namespace. If two rows begin from the same namespace, then such a namespace is called areusednamespace. A reused namespace does not need to be a prefix. In the example below, namespacesAA,BBare reused twice because a better arrangement does not exist (prefix namespaceHHhas been reused 4 times):
[EE, AA, BB, CC, ...][HH, AA, BB, RR, ...][HH, ...]x(4)
- The amount of reused namespaces is a score calculated for the whole matrix. If
Nrows share a common prefix of lengthK, then(N - 1) * Kis the amount of reused namespaces. - Let's define a score for a matrix as an amount of reused namespaces for all matrix rows.
- It is allowed to reorder elements in each row. Other operations are disallowed.
- Let's define an arrangement as a configuration of the matrix where the elements in each row are reordered.
- Constraints: the largest matrix I have is 1067 rows by 19 columns.
I need to find an arrangement that maximizes the matrix score. An easier explanation is I need to sort all row elements in a way that maximizes the score. The problem is that greedy solution (sort each row by counter across the matrix) will not work for all cases.
Example 1:
[AA, CC, DD, ...][AA, CC, DD, ...](score: +3, the whole prefix is reused)[BB, CC, KK, ...](CC,KKnamespaces were reused, even though prefix namespaces differ (BBandPP)); (score +1 forCC)[PP, CC, KK, ...](score: +2;CC,KKwhere reused)
Example 2 (possible arrangement for the example matrix above; calculated by the code below):
| -- | c1 | c2 | c3 |
|---|---|---|---|
| r1 | BB | CC | AA |
| r2 | BB (score +1) | CC (score +1) | RR |
| r3 | EE | DD | FF |
| r4 | KK | DD | EE |
| r5 | KK (score +1) | DD (score +1) | GG |
| r6 | KK (score +1) | PP |
I implemented a naive version in Python.
First, I defined a ranker returning a score for a matrix arrangement. I used a prefix tree to get the count of reused prefixes (it is a simplification of what I need):
class Trie:
def __init__(self, matrix: list[list[str]]):
self.root = {}
self.score = 0
for row in matrix:
self.add(row)
def add(self, row: list[str]):
root = self.root
for word in row:
if (child := root.get(word)) is not None:
if word: # handle padding empty strings
self.score += 1
else:
child = {}
root[word] = child
root = child
Second, I get all possible permutations across all columns and rows:
def generate_permutations(matrix: list[list[str]]):
num_cols = len(matrix[0])
col_indices = [*range(num_cols)]
col_permutations = [*permutations(col_indices)]
row_permutations = product(col_permutations, repeat=len(matrix))
for row_permutation in row_permutations:
permuted_matrix = []
for i, row in enumerate(matrix):
permuted_row = [row[col_idx] for col_idx in row_permutation[i]]
permuted_matrix.append(permuted_row)
yield permuted_matrix
Finally, I find the best arrangement based on the ranker score:
def namespace_sort(matrix: list[list[str]]):
best_arrangement = copy.deepcopy(matrix)
best_arrangement_score = Trie(best_arrangement).score
for arrangement in generate_permutations(matrix):
score = Trie(arrangement).score
if score > best_arrangement_score:
best_arrangement_score = score
best_arrangement = arrangement
return best_arrangement
This algorithm will never work for hundreds of rows and columns.
Is there a known optimal algorithm I can use to solve the problem? Otherwise how can I optimize the solution?
1 Answer 1
Taking two rows in 1. to mean two consecutive rows, it would seem to suffice to establish counts of each namespace in either row and permute the subsequent row such that common ones appear in the same column.
I would not call this Reorder columns in a 2d matrix.
I have no idea how repeated subarrays fit into the picture.
def namespace_align(matrix: list[list[str]]):
''' Return a matrix with same values in subsequent rows
column aligned where possible.
'''
reference = matrix[0][:]
ref_counts = Counter(reference)
arrangement = [reference]
#
for row in matrix[1:]:
counts = Counter(row)
fresh = (counts - ref_counts).elements()
arranged = []
for namespace in reference:
if 0 < counts[namespace]:
arranged.append(namespace)
counts[namespace] -= 1
else:
arranged.append(next(fresh))
arrangement.append(arranged)
reference, ref_counts = arranged, counts
#
return arrangement
Reorder columns, the procedure descriptionall possible permutations across all columns and rows(which I don't quite see ingenerate_permutations()), and there's a definition of arrangement in5.. The two examples contradict each other: in Example 1, CC & KK are mentioned to increase score, different c1 value notwithstanding. In Example 2, the DD in r4, c2 is not annotated to increase score - which I think a better fit withIf two rows begin from the same .... Without this constraint, how would reordering matrix columns change score? $\endgroup$