Reorder columns in a 2d matrix to maximize the count of all repeated subarrays across all rows

Question 1

I have a matrix (input):

--	c1	c2	c3
r1	AA	BB	CC
r2	CC	RR	BB
r3	EE	DD	FF
r4	KK	DD	EE
r5	DD	GG	KK
r6	PP	QQ	KK

Let's call each matrix cell a namespace. If two rows begin from the same namespace, then such a namespace is called a reused namespace. A reused namespace does not need to be a prefix. In the example below, namespaces AA, BB are reused twice because a better arrangement does not exist (prefix namespace HH has been reused 4 times):

[EE, AA, BB, CC, ...]
[HH, AA, BB, RR, ...]
[HH, ...] x(4)

The amount of reused namespaces is a score calculated for the whole matrix. If N rows share a common prefix of length K, then (N - 1) * K is the amount of reused namespaces.
Let's define a score for a matrix as an amount of reused namespaces for all matrix rows.
It is allowed to reorder elements in each row. Other operations are disallowed.
Let's define an arrangement as a configuration of the matrix where the elements in each row are reordered.
Constraints: the largest matrix I have is 1067 rows by 19 columns.

I need to find an arrangement that maximizes the matrix score. An easier explanation is I need to sort all row elements in a way that maximizes the score. The problem is that greedy solution (sort each row by counter across the matrix) will not work for all cases.

Example 1:

[AA, CC, DD, ...]
[AA, CC, DD, ...] (score: +3, the whole prefix is reused)
[BB, CC, KK, ...] (CC, KK namespaces were reused, even though prefix namespaces differ (BB and PP)); (score +1 for CC)
[PP, CC, KK, ...] (score: +2; CC, KK where reused)

Example 2 (possible arrangement for the example matrix above; calculated by the code below):

--	c1	c2	c3
r1	BB	CC	AA
r2	BB (score +1)	CC (score +1)	RR
r3	EE	DD	FF
r4	KK	DD	EE
r5	KK (score +1)	DD (score +1)	GG
r6	KK (score +1)	PP	QQ

I implemented a naive version in Python.

First, I defined a ranker returning a score for a matrix arrangement. I used a prefix tree to get the count of reused prefixes (it is a simplification of what I need):

class Trie:
 def __init__(self, matrix: list[list[str]]):
 self.root = {}
 self.score = 0
 for row in matrix:
 self.add(row)
 def add(self, row: list[str]):
 root = self.root
 for word in row:
 if (child := root.get(word)) is not None:
 if word: # handle padding empty strings
 self.score += 1
 else:
 child = {}
 root[word] = child
 root = child

Second, I get all possible permutations across all columns and rows:

def generate_permutations(matrix: list[list[str]]):
 num_cols = len(matrix[0])
 col_indices = [*range(num_cols)]
 col_permutations = [*permutations(col_indices)]
 row_permutations = product(col_permutations, repeat=len(matrix))
 for row_permutation in row_permutations:
 permuted_matrix = []
 for i, row in enumerate(matrix):
 permuted_row = [row[col_idx] for col_idx in row_permutation[i]]
 permuted_matrix.append(permuted_row)
 yield permuted_matrix

Finally, I find the best arrangement based on the ranker score:

def namespace_sort(matrix: list[list[str]]):
 best_arrangement = copy.deepcopy(matrix)
 best_arrangement_score = Trie(best_arrangement).score
 for arrangement in generate_permutations(matrix):
 score = Trie(arrangement).score
 if score > best_arrangement_score:
 best_arrangement_score = score
 best_arrangement = arrangement
 return best_arrangement

This algorithm will never work for hundreds of rows and columns.

Is there a known optimal algorithm I can use to solve the problem? Otherwise how can I optimize the solution?

Question 2

How do you define "the amount of reused namespaces"? What do you mean by "an arrangement"? What is the input to the algorithm, and what is the desired output? Are you given a matrix, and you want to re-order it? Or are we allowed to fill in the matrix in any way we want? What are the constraints on what matrix we're allowed to produce or what kinds of re-orderings are allowed? I don't understand what the examples are trying to indicate or what the notation in the examples is.

Question 3

Rather than using "EDIT:" and appending more information, we would prefer that you revise the question so it reads well and is understandable for people who encounter this page for the first time. See cs.meta.stackexchange.com/q/657/755.

Question 4

Please add the score for Example 1**&**2.

Question 5

@greybeard, I added a sample. Please keep in mind that I do not have a perfect scoring implementation, and these examples show what I have to achieve.

Question 6

The title mentions Reorder columns, the procedure description all possible permutations across all columns and rows (which I don't quite see in generate_permutations()), and there's a definition of arrangement in 5.. The two examples contradict each other: in Example 1, CC & KK are mentioned to increase score, different c1 value notwithstanding. In Example 2, the DD in r4, c2 is not annotated to increase score - which I think a better fit with If two rows begin from the same .... Without this constraint, how would reordering matrix columns change score?

Question 7

Taking two rows in 1. to mean two consecutive rows, it would seem to suffice to establish counts of each namespace in either row and permute the subsequent row such that common ones appear in the same column.
I would not call this Reorder columns in a 2d matrix.
I have no idea how repeated subarrays fit into the picture.

def namespace_align(matrix: list[list[str]]):
 ''' Return a matrix with same values in subsequent rows
 column aligned where possible.
 '''
 reference = matrix[0][:]
 ref_counts = Counter(reference)
 arrangement = [reference]
 #
 for row in matrix[1:]:
 counts = Counter(row)
 fresh = (counts - ref_counts).elements()
 arranged = []
 for namespace in reference:
 if 0 < counts[namespace]:
 arranged.append(namespace)
 counts[namespace] -= 1
 else:
 arranged.append(next(fresh))
 arrangement.append(arranged)
 reference, ref_counts = arranged, counts
 #
 return arrangement

greybeard 1,1942 gold badges10 silver badges24 bronze badges · Answer 1 · 2023-04-28 06:12:40Z

Taking two rows in 1. to mean two consecutive rows, it would seem to suffice to establish counts of each namespace in either row and permute the subsequent row such that common ones appear in the same column.
I would not call this Reorder columns in a 2d matrix.
I have no idea how repeated subarrays fit into the picture.

def namespace_align(matrix: list[list[str]]):
 ''' Return a matrix with same values in subsequent rows
 column aligned where possible.
 '''
 reference = matrix[0][:]
 ref_counts = Counter(reference)
 arrangement = [reference]
 #
 for row in matrix[1:]:
 counts = Counter(row)
 fresh = (counts - ref_counts).elements()
 arranged = []
 for namespace in reference:
 if 0 < counts[namespace]:
 arranged.append(namespace)
 counts[namespace] -= 1
 else:
 arranged.append(next(fresh))
 arrangement.append(arranged)
 reference, ref_counts = arranged, counts
 #
 return arrangement

Stack Exchange Network

Reorder columns in a 2d matrix to maximize the count of all repeated subarrays across all rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reorder columns in a 2d matrix to maximize the count of all repeated subarrays across all rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions