Longest common substring in linear time

Question 1

This challenge is about writing code to solve the following problem.

Given two strings A and B, your code should output the start and end indices of a substring of A with the following properties.

The substring of A should also match some substring of B.
There should be no longer substring of A that satisfies the first property.

For example:

A = xxxappleyyyyyyy
B = zapplezzz

The substring apple with indices 4 8 (indexing from 1) would be a valid output.

Functionality

You can assume the input will be on standard in or in a file in the local directory, that is your choice. The file format will simply be two strings, separated by a new line. The answer should be a full program and not just a function.

I would eventually like to test your code on two substrings taken from the strings in http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ .

Score

This is code-golf with a twist. Your code must be run in O(n) time, where n is the total length of the input.

Languages and libraries

You can use any language which has a freely available compiler/interpreter/etc. for Linux. You should only use standard open source libraries not designed to solve this task. In case of dispute, I will count this as any library which either comes as standard with your language or one that you can install in a default ubuntu machine from a default repository.

Useful information

There are at least two ways to solve this problem in linear time. One is to first compute the suffix tree and the second is to first compute the suffix array and the LCP array.

Here is a full and (perhaps over-)detailed explanation of linear time suffix tree construction (it turns out some of the figures are messed up unfortunately). There is also a very nice SO answer about linear time suffix tree construction at https://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english . It includes a link to source code too. Another detailed explanation can be found here, this time giving a full solution in C.
Section 2 of http://www.cs.cmu.edu/~guyb/realworld/papersS04/KaSa03.pdf gives a linear time suffix array construction algorithm and Appendix A has C++ source code. This answer tells you how then to compute the longest common substring https://cs.stackexchange.com/questions/9555/computing-the-longest-common-substring-of-two-strings-using-suffix-arrays . Section 5 of https://courses.csail.mit.edu/6.851/spring12/scribe/lec16.pdf which also has an associated video lecture https://courses.csail.mit.edu/6.851/spring12/lectures/L16.html also explains the same algorithm starting at 1:16:00.

Question 2

O(n) time Are you sure it is possible?

Question 3

@Lembik Sorry, but these are very complex algorithms and it's not really fun to golve 100+ lines of code.

Question 4

The article at the second link you provide under "Useful information" says that "constructing [a suffix tree] requires O(N^2) time"

Question 5

@Lembik You should just make the question [fastest-code], where the program with the best worst-case in big-oh notation wins. Then you'll at least get some answers, and in the even that someone can solve it in O(n), they'll win.

Question 6

This must be the question with the most deleted answers per valid answer...

Question 7

Python 2, 646 bytes

G=range;w=raw_input;z=L,m,h=[0]*3
s=w();t=len(s);s+='!%s#'%w();u=len(s);I=z*u
def f(s,n):
 def r(o):
	b=[[]for _ in s];c=[]
	for x in B[:N]:b[s[x+o]]+=x,
	map(c.extend,b);B[:N]=c
 M=N=n--~n/3;t=n%3%2;B=G(n+t);del B[::3];r(2);u=m=p=r(1)>r(0);N-=n/3
 for x in B*1:v=s[x:x+3];m+=u<v;u=v;B[x/3+x%3/2*N]=m
 A=1/M*z or f(B+z,M)+z;B=[x*3for x in A if x<N];J=I[r(0):n];C=G(n)
 for k in C:b=A[t]/N;a=i,j=A[t]%N*3-~b,B[p];q=p<N<(s[i:i-~b],J[i/3+b+N-b*N])>(s[j+t/M:j-~b],J[j/3+b*N]);C[k]=x=a[q];I[x]=k;p+=q;t+=1-q
 return C
S=f(map(ord,s)+z*40,u)
for i in G(u):
 h-=h>0;j=S[I[i]-1]
 while s[i+h]==s[j+h]:h+=1
 if(i<t)==(t<j)<=h>m:m=h;L=min(i,j)
print-~L,L+m

Try it online

This uses the skew algorithm described in "Simple Linear Work Suffix Array Construction" by Kärkkäinen and Sanders. The C++ implementation included in that paper already feels a little "golfy", but there is still plenty of room for making it shorter. For example, we can recurse until arriving at an array of length one, instead of short circuiting as in the paper, without violating the O(n) requirement.

For the LCP part, I followed "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications" by Kasai et al.

The program outputs 1 0 if the longest common substring is empty.

Here is some development code that includes an earlier version of the program that follows the C++ implementation a bit more closely, some slower approaches for comparison, and a simple test case generator:

from random import *
def brute(a,b):
 L=R=m=0
 
 for i in range(len(a)):
 for j in range(i+m+1,len(a)+1):
 if a[i:j] in b:
 m=j-i
 L,R=i,j
 
 return L+1,R
def suffix_array_slow(s):
 S=[]
 for i in range(len(s)):
 S+=[(s[i:],i)]
 S.sort()
 return [x[1] for x in S]
def slow1(a,b):
 # slow suffix array, slow lcp
 
 s=a+'!'+b
 S=suffix_array_slow(s)
 
 L=R=m=0
 
 for i in range(1,len(S)):
 x=S[i-1]
 y=S[i]
 p=s[x:]+'#'
 q=s[y:]+'$'
 h=0
 while p[h]==q[h]:
 h+=1
 if h>m and len(a)==sorted([x,y,len(a)])[1]:
 m=h
 L=min(x,y)
 R=L+h
 
 return L+1,R
def verify(a,b,L,R):
 if L<1 or R>len(a) or a[L-1:R] not in b:
 return 0
 LL,RR=brute(a,b)
 return R-L==RR-LL
def rand_string():
 if randint(0,1):
 n=randint(0,8)
 else:
 n=randint(0,24)
 a='zyxwvutsrq'[:randint(1,10)]
 s=''
 for _ in range(n):
 s+=choice(a)
 return s
def stress_test(f):
 numtrials=2000
 for trial in range(numtrials):
 a=rand_string()
 b=rand_string()
 L,R=f(a,b)
 if not verify(a,b,L,R):
 LL,RR=brute(a,b)
 print 'failed on',(a,b)
 print 'expected:',LL,RR
 print 'actual:',L,R
 return
 print 'ok'
def slow2(a,b):
 # slow suffix array, linear lcp
 
 s=a+'!'+b+'#'
 S=suffix_array_slow(s)
 
 I=S*1
 for i in range(len(S)):
 I[S[i]]=i
 
 L=R=m=h=0
 
 for i in range(len(S)):
 if I[i]:
 j=S[I[i]-1]
 while s[i+h]==s[j+h]:
 h+=1
 if h>m and len(a)==sorted([i,j,len(a)])[1]:
 m=h
 L=min(i,j)
 R=L+h
 h-=h>0
 
 return L+1,R
def suffix_array(s,K):
 # skew algorithm
 
 n=len(s)
 s+=[0]*3
 n0=(n+2)/3
 n1=(n+1)/3
 n2=n/3
 n02=n0+n2
 adj=n0-n1
 
 def radix_pass(a,o,n=n02):
 c=[0]*(K+3)
 for x in a[:n]:
 c[s[x+o]+1]+=1
 for i in range(K+3):
 c[i]+=c[i-1]
 for x in a[:n]:
 j=s[x+o]
 a[c[j]]=x
 c[j]+=1
 
 A=[x for x in range(n+adj) if x%3]+[0]*3
 
 radix_pass(A,2)
 radix_pass(A,1)
 radix_pass(A,0)
 
 B=[0]*n02
 t=m=0
 
 for x in A[:n02]:
 u=s[x:x+3]
 m+=t<u
 t=u
 B[x/3+x%3/2*n0]=m
 
 A[:n02]=1/n02*[0]or suffix_array(B,m)
 I=A*1
 for i in range(n02):
 I[A[i]]=i+1
 
 B=[3*x for x in A if x<n0]
 radix_pass(B,0,n0)
 
 R=[]
 
 p=0
 t=adj
 while t<n02:
 x=A[t]
 b=x>=n0
 i=(x-b*n0)*3-~b
 j=B[p]
 if p==n0 or ((s[i:i+2],I[A[t]-n0+1])<(s[j:j+2],I[j/3+n0]) if b else (s[i],I[A[t]+n0])<(s[j],I[j/3])):R+=i,;t+=1
 else:R+=j,;p+=1
 
 return R+B[p:n0]
def solve(a,b):
 # linear
 
 s=a+'!'+b+'#'
 S=suffix_array(map(ord,s),128)
 
 I=S*1
 for i in range(len(S)):
 I[S[i]]=i
 
 L=R=m=h=0
 
 for i in range(len(S)):
 if I[i]:
 j=S[I[i]-1]
 while s[i+h]==s[j+h]:
 h+=1
 if h>m and len(a)==sorted([i,j,len(a)])[1]:
 m=h
 L=min(i,j)
 R=L+h
 h-=h>0
 
 return L+1,R
stress_test(solve)

Question 8

Correct me if I'm wrong, but isn't this actually 739 bytes? I copied into mothereff.in/byte-counter and deleted 2 spaces from lines 6-9, but I'm not sure if that's correct.

Question 9

@PatrickRoberts Those are tabs.

Question 10

Nice answer! You might want to have a look at GSACA a novel linear time SACA from 2016. Reference implementation is 246 lines full of comments (170 without comments) and seems very golfable. You'll find it on github.

Question 11

@MitchSchwartz I'm currently trying to stay on noPMO, so I can't feel emotions strongly right now (probably due to inbalanced brain chemicals). At the time of reading the code quickly, my syntax golfing motor spotted that, and I don't remember feeling any specific emotions. Did you think of the same thing or why the question? :) Now I'm curious.

Question 12

@TuukkaX That's an interesting response that I wasn't expecting. Well, I'm not sure if I should be phrasing this in some special way, but the fact that your original comment wasn't actually correct played some part in why I decided to ask. :)

Mitch Schwartz Mitch Schwartz 5,9091 gold badge28 silver badges26 bronze badges · Accepted Answer · 2017-01-25 04:02:05Z

Python 2, 646 bytes

G=range;w=raw_input;z=L,m,h=[0]*3
s=w();t=len(s);s+='!%s#'%w();u=len(s);I=z*u
def f(s,n):
 def r(o):
	b=[[]for _ in s];c=[]
	for x in B[:N]:b[s[x+o]]+=x,
	map(c.extend,b);B[:N]=c
 M=N=n--~n/3;t=n%3%2;B=G(n+t);del B[::3];r(2);u=m=p=r(1)>r(0);N-=n/3
 for x in B*1:v=s[x:x+3];m+=u<v;u=v;B[x/3+x%3/2*N]=m
 A=1/M*z or f(B+z,M)+z;B=[x*3for x in A if x<N];J=I[r(0):n];C=G(n)
 for k in C:b=A[t]/N;a=i,j=A[t]%N*3-~b,B[p];q=p<N<(s[i:i-~b],J[i/3+b+N-b*N])>(s[j+t/M:j-~b],J[j/3+b*N]);C[k]=x=a[q];I[x]=k;p+=q;t+=1-q
 return C
S=f(map(ord,s)+z*40,u)
for i in G(u):
 h-=h>0;j=S[I[i]-1]
 while s[i+h]==s[j+h]:h+=1
 if(i<t)==(t<j)<=h>m:m=h;L=min(i,j)
print-~L,L+m

Try it online

This uses the skew algorithm described in "Simple Linear Work Suffix Array Construction" by Kärkkäinen and Sanders. The C++ implementation included in that paper already feels a little "golfy", but there is still plenty of room for making it shorter. For example, we can recurse until arriving at an array of length one, instead of short circuiting as in the paper, without violating the O(n) requirement.

For the LCP part, I followed "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications" by Kasai et al.

The program outputs 1 0 if the longest common substring is empty.

Here is some development code that includes an earlier version of the program that follows the C++ implementation a bit more closely, some slower approaches for comparison, and a simple test case generator:

from random import *
def brute(a,b):
 L=R=m=0
 
 for i in range(len(a)):
 for j in range(i+m+1,len(a)+1):
 if a[i:j] in b:
 m=j-i
 L,R=i,j
 
 return L+1,R
def suffix_array_slow(s):
 S=[]
 for i in range(len(s)):
 S+=[(s[i:],i)]
 S.sort()
 return [x[1] for x in S]
def slow1(a,b):
 # slow suffix array, slow lcp
 
 s=a+'!'+b
 S=suffix_array_slow(s)
 
 L=R=m=0
 
 for i in range(1,len(S)):
 x=S[i-1]
 y=S[i]
 p=s[x:]+'#'
 q=s[y:]+'$'
 h=0
 while p[h]==q[h]:
 h+=1
 if h>m and len(a)==sorted([x,y,len(a)])[1]:
 m=h
 L=min(x,y)
 R=L+h
 
 return L+1,R
def verify(a,b,L,R):
 if L<1 or R>len(a) or a[L-1:R] not in b:
 return 0
 LL,RR=brute(a,b)
 return R-L==RR-LL
def rand_string():
 if randint(0,1):
 n=randint(0,8)
 else:
 n=randint(0,24)
 a='zyxwvutsrq'[:randint(1,10)]
 s=''
 for _ in range(n):
 s+=choice(a)
 return s
def stress_test(f):
 numtrials=2000
 for trial in range(numtrials):
 a=rand_string()
 b=rand_string()
 L,R=f(a,b)
 if not verify(a,b,L,R):
 LL,RR=brute(a,b)
 print 'failed on',(a,b)
 print 'expected:',LL,RR
 print 'actual:',L,R
 return
 print 'ok'
def slow2(a,b):
 # slow suffix array, linear lcp
 
 s=a+'!'+b+'#'
 S=suffix_array_slow(s)
 
 I=S*1
 for i in range(len(S)):
 I[S[i]]=i
 
 L=R=m=h=0
 
 for i in range(len(S)):
 if I[i]:
 j=S[I[i]-1]
 while s[i+h]==s[j+h]:
 h+=1
 if h>m and len(a)==sorted([i,j,len(a)])[1]:
 m=h
 L=min(i,j)
 R=L+h
 h-=h>0
 
 return L+1,R
def suffix_array(s,K):
 # skew algorithm
 
 n=len(s)
 s+=[0]*3
 n0=(n+2)/3
 n1=(n+1)/3
 n2=n/3
 n02=n0+n2
 adj=n0-n1
 
 def radix_pass(a,o,n=n02):
 c=[0]*(K+3)
 for x in a[:n]:
 c[s[x+o]+1]+=1
 for i in range(K+3):
 c[i]+=c[i-1]
 for x in a[:n]:
 j=s[x+o]
 a[c[j]]=x
 c[j]+=1
 
 A=[x for x in range(n+adj) if x%3]+[0]*3
 
 radix_pass(A,2)
 radix_pass(A,1)
 radix_pass(A,0)
 
 B=[0]*n02
 t=m=0
 
 for x in A[:n02]:
 u=s[x:x+3]
 m+=t<u
 t=u
 B[x/3+x%3/2*n0]=m
 
 A[:n02]=1/n02*[0]or suffix_array(B,m)
 I=A*1
 for i in range(n02):
 I[A[i]]=i+1
 
 B=[3*x for x in A if x<n0]
 radix_pass(B,0,n0)
 
 R=[]
 
 p=0
 t=adj
 while t<n02:
 x=A[t]
 b=x>=n0
 i=(x-b*n0)*3-~b
 j=B[p]
 if p==n0 or ((s[i:i+2],I[A[t]-n0+1])<(s[j:j+2],I[j/3+n0]) if b else (s[i],I[A[t]+n0])<(s[j],I[j/3])):R+=i,;t+=1
 else:R+=j,;p+=1
 
 return R+B[p:n0]
def solve(a,b):
 # linear
 
 s=a+'!'+b+'#'
 S=suffix_array(map(ord,s),128)
 
 I=S*1
 for i in range(len(S)):
 I[S[i]]=i
 
 L=R=m=h=0
 
 for i in range(len(S)):
 if I[i]:
 j=S[I[i]-1]
 while s[i+h]==s[j+h]:
 h+=1
 if h>m and len(a)==sorted([i,j,len(a)])[1]:
 m=h
 L=min(i,j)
 R=L+h
 h-=h>0
 
 return L+1,R
stress_test(solve)

Correct me if I'm wrong, but isn't this actually 739 bytes? I copied into mothereff.in/byte-counter and deleted 2 spaces from lines 6-9, but I'm not sure if that's correct.
Nice answer! You might want to have a look at GSACA a novel linear time SACA from 2016. Reference implementation is 246 lines full of comments (170 without comments) and seems very golfable. You'll find it on github.
@MitchSchwartz I'm currently trying to stay on noPMO, so I can't feel emotions strongly right now (probably due to inbalanced brain chemicals). At the time of reading the code quickly, my syntax golfing motor spotted that, and I don't remember feeling any specific emotions. Did you think of the same thing or why the question? :) Now I'm curious.
@TuukkaX That's an interesting response that I wasn't expecting. Well, I'm not sure if I should be phrasing this in some special way, but the fact that your original comment wasn't actually correct played some part in why I decided to ask. :)

Stack Exchange Network

Longest common substring in linear time

1 Answer 1

Python 2, 646 bytes

Linked

Hot Network Questions

Longest common substring in linear time

1 Answer 1

Python 2, 646 bytes

Linked

Related

Hot Network Questions