String Equivalence: Counting Substrings

Question 1

Background

Two strings s and t are called k-Abelian equivalent (shortened to k-equivalent in the following) for a positive integer k if the following conditions hold:

The length-k-1 prefixes of s and t are equal.
The length-k-1 suffixes of s and t are equal.
The strings s and t have the same multisets of length-k contiguous substrings. In other words, every string of length k occurs an equal number of times in s and t. Overlapping occurrences are counted as distinct.

Note that k-equivalent strings are always k-1-equivalent if k > 1, and the third condition implies that s and t have the same length. It is also known, and possibly useful, that the three conditions above are equivalent to the following one:

Every non-empty string of length k or k-1 occurs an equal number of times in s and t.

An Example

For example, consider the strings s = "abbaaabb" and t = "aabbaabb". Because the strings have the same number of each character, 4 of a and 4 of b, they are 1-equivalent. Consider then 2-equivalence. Their first and last characters are the same (both strings begin with a and end with b), so the first two conditions are satisfied. The length-2 substrings of s are aa (occurs twice), ab (twice), ba (once), and bb (twice), and those of t are exactly the same. This means that the strings are 2-equivalent. However, since their second letters are different, the strings are not 3-equivalent.

Input

Two alphanumeric strings s and t of the same length n > 0.

Output

The largest integer k between 1 and n (inclusive) such that s and t are k-equivalent. If no such k exists, the output is 0. In the above example, the correct output would be 2.

Rules

You can write a full program or a function. The lowest byte count wins, and standard loopholes are disallowed. Crashing on malformed input is perfectly fine.

Test Cases

"abc" "def" -> 0
"abbaabbb" "aabbbaab" -> 0
"abbaaabb" "aabbbaab" -> 2
"abbaaabb" "aabbaabb" -> 2
"aaabbaabb" "aabbaaabb" -> 3
"aaabbaabb" "aaabbaabb" -> 9
"abbabaabbaababbabaababbaabbabaab" "abbabaabbaababbaabbabaababbabaab" -> 9
"yzxyxx" "yxyzxx" -> 2
"xxxxxyxxxx" "xxxxyxxxxx" -> 5

Question 2

Why is the solution for the third case "abbaaabb" "aabbbaab" = 1? The prefixes a, a and the suffixes b, b are the same, and the substrings aa, ab, bb appear each twice, and the substring ba appears once in each string.

Question 3

@Jakube prefix of N length

Question 4

@Optimizer It says: "The length-k-1 prefixes of s and t are equal."

Question 5

@Jakube Oh right

Question 6

@Jakube Good find. There may be something wrong with my reference implementation, or it may be my mistake. I'll change it to a 2 for now, and investigate this later.

Question 7

Pyth: (削除) 26 (削除ここまで) 22 bytes

lfq_TTCmmSm<>dbhkldldQ

Input are two strings like "abc","def". Try it online: Pyth Compiler/Executor

Explanation:

Quite some complicated code. But I'll start easy.

First we want to create all substrings of length k of the string d.

Sm<>dbhkld
 m ld apply a function ... to all elements b 
 in the list [0, 1, 2, ..., len(d)-1]
 >db take the string d starting from the position b
 < hk but only hk = k + 1 characters
S Sort the resulting list.

For the string "acbde" and k = 3 this results in ['abc', 'bde', 'cbd', 'de', 'e']. Notice, that 'de' and 'e' are part of this list. This will be quite important in a bit.

I want the substrings for each k, one map again.

mSm<>dbhkldld
m ld apply a function (create all substrings of lenght k) to 
 all elements k in the list [0, 1, 2, ..., len(d)-1]

For the string "acbd" this results in [['a', 'b', 'c', 'd'], ['ac', 'bd', 'cb', 'd'], ['acb', 'bd', 'cbd', 'd'], ['acbd', 'bd', 'cbd', 'd']].

I want this for both strings, another map.

mmSm<>dbhkldldQ
m Q apply the function (see above) for all d in input()

[[['a', 'b', 'c'], ['ab', 'bc', 'c'], ['abc', 'bc', 'c']], [['d', 'e', 'f'], ['de', 'ef', 'f'], ['def', 'ef', 'f']]] for the input "abc","def", and after a zip C, this looks [(['a', 'b', 'c'], ['d', 'e', 'f']), (['ab', 'bc', 'c'], ['de', 'ef', 'f']), (['abc', 'bc', 'c'], ['def', 'ef', 'f'])].

Now comes the exciting part: We are of course only interest in the ks, where the substrings of both words are the same.

lfq_TTCmmSm<>dbhkldldQ
 f I filter the list for elements T, where
 q_TT reversed(T) == T
 (basically T[0] == T[1])
l print the lenght of the resulting list.
 (= the number of times, where the substrings of length k 
 of string 1 are equal to the substrings of 
 length k of string 2)

Since we also have the suffixes of length < k in Ts, this also checks, if the suffixes are equal.

The only thing, which I don't check is, if the prefixes are equal. But as it turns out, this isn't necessary. Lemma 2.3 (4<=>5) of the linked paper says: If the substrings of length k are the same, than pref_{k-1}(string 1) = pref_{k-1}(string 2) is equivalent to suff_{k-1}(string 1) = suff_{k-1}(string 2). Since the suffixes are equal, the prefixes are also equal.

Question 8

the 'ab' in g"abbab"3 is from the last two characters, not the prefix. So no, you are not correctly following the spec as you do not check for k-1 length prefix at all

Question 9

@Optimizer Lemma 2.3 in the linked paper does the job. I'll add some more explanation.

Question 10

Are you saying that checking for k-1 length prefix is not at all required ?

Question 11

@Optimizer Exactly

Question 12

@Zgarb Thanks for accepting my answer. Too sad, this question didn't get more recognition. I found this was one of the best challenges in a while, and am quite proud of my solution.

Question 13

CJam, (削除) 51 49 47 41 (削除ここまで) 34 bytes

WqN/:Q0=,,{Qf{_,,\f>\)f<$}~=},+W=)

Just to get something started. This can surely be golfed a lot. Looking at a different algorithm too.

Input goes as the two strings without quotes on separate lines.

Example:

aaabbaabb
aabbaaabb

Output:

Try it online here

Jakube 22k3 gold badges28 silver badges108 bronze badges · Accepted Answer · 2015-02-06 17:11:24Z

Pyth: (削除) 26 (削除ここまで) 22 bytes

lfq_TTCmmSm<>dbhkldldQ

Input are two strings like "abc","def". Try it online: Pyth Compiler/Executor

Explanation:

Quite some complicated code. But I'll start easy.

First we want to create all substrings of length k of the string d.

Sm<>dbhkld
 m ld apply a function ... to all elements b 
 in the list [0, 1, 2, ..., len(d)-1]
 >db take the string d starting from the position b
 < hk but only hk = k + 1 characters
S Sort the resulting list.

For the string "acbde" and k = 3 this results in ['abc', 'bde', 'cbd', 'de', 'e']. Notice, that 'de' and 'e' are part of this list. This will be quite important in a bit.

I want the substrings for each k, one map again.

mSm<>dbhkldld
m ld apply a function (create all substrings of lenght k) to 
 all elements k in the list [0, 1, 2, ..., len(d)-1]

For the string "acbd" this results in [['a', 'b', 'c', 'd'], ['ac', 'bd', 'cb', 'd'], ['acb', 'bd', 'cbd', 'd'], ['acbd', 'bd', 'cbd', 'd']].

I want this for both strings, another map.

mmSm<>dbhkldldQ
m Q apply the function (see above) for all d in input()

[[['a', 'b', 'c'], ['ab', 'bc', 'c'], ['abc', 'bc', 'c']], [['d', 'e', 'f'], ['de', 'ef', 'f'], ['def', 'ef', 'f']]] for the input "abc","def", and after a zip C, this looks [(['a', 'b', 'c'], ['d', 'e', 'f']), (['ab', 'bc', 'c'], ['de', 'ef', 'f']), (['abc', 'bc', 'c'], ['def', 'ef', 'f'])].

Now comes the exciting part: We are of course only interest in the ks, where the substrings of both words are the same.

lfq_TTCmmSm<>dbhkldldQ
 f I filter the list for elements T, where
 q_TT reversed(T) == T
 (basically T[0] == T[1])
l print the lenght of the resulting list.
 (= the number of times, where the substrings of length k 
 of string 1 are equal to the substrings of 
 length k of string 2)

Since we also have the suffixes of length < k in Ts, this also checks, if the suffixes are equal.

The only thing, which I don't check is, if the prefixes are equal. But as it turns out, this isn't necessary. Lemma 2.3 (4<=>5) of the linked paper says: If the substrings of length k are the same, than pref_{k-1}(string 1) = pref_{k-1}(string 2) is equivalent to suff_{k-1}(string 1) = suff_{k-1}(string 2). Since the suffixes are equal, the prefixes are also equal.

the 'ab' in g"abbab"3 is from the last two characters, not the prefix. So no, you are not correctly following the spec as you do not check for k-1 length prefix at all
@Optimizer Lemma 2.3 in the linked paper does the job. I'll add some more explanation.
Are you saying that checking for k-1 length prefix is not at all required ?
@Zgarb Thanks for accepting my answer. Too sad, this question didn't get more recognition. I found this was one of the best challenges in a while, and am quite proud of my solution.

Stack Exchange Network

String Equivalence: Counting Substrings

Background

An Example

Input

Output

Rules

Test Cases

2 Answers 2

Pyth: (削除) 26 (削除ここまで) 22 bytes

Explanation:

CJam, (削除) 51 49 47 41 (削除ここまで) 34 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

String Equivalence: Counting Substrings

Background

An Example

Input

Output

Rules

Test Cases

2 Answers 2

Pyth: (削除) 26 (削除ここまで) 22 bytes

Explanation:

CJam, (削除) 51 49 47 41 (削除ここまで) 34 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions