This challenge is about writing code to solve the following problem.
Given two strings A and B, your code should output the start and end indices of a substring of A with the following properties.
- The substring of A should also match some substring of B.
- There should be no longer substring of A that satisfies the first property.
For example:
A = xxxappleyyyyyyy
B = zapplezzz
The substring apple
with indices 4 8
(indexing from 1) would be a valid output.
Functionality
You can assume the input will be on standard in or in a file in the local directory, that is your choice. The file format will simply be two strings, separated by a new line. The answer should be a full program and not just a function.
I would eventually like to test your code on two substrings taken from the strings in http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ .
Score
This is code-golf with a twist. Your code must be run in O(n)
time, where n
is the total length of the input.
Languages and libraries
You can use any language which has a freely available compiler/interpreter/etc. for Linux. You should only use standard open source libraries not designed to solve this task. In case of dispute, I will count this as any library which either comes as standard with your language or one that you can install in a default ubuntu machine from a default repository.
Useful information
There are at least two ways to solve this problem in linear time. One is to first compute the suffix tree and the second is to first compute the suffix array and the LCP array.
- Here is a full and (perhaps over-)detailed explanation of linear time suffix tree construction (it turns out some of the figures are messed up unfortunately). There is also a very nice SO answer about linear time suffix tree construction at https://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english . It includes a link to source code too. Another detailed explanation can be found here, this time giving a full solution in C.
- Section 2 of http://www.cs.cmu.edu/~guyb/realworld/papersS04/KaSa03.pdf gives a linear time suffix array construction algorithm and Appendix A has C++ source code. This answer tells you how then to compute the longest common substring https://cs.stackexchange.com/questions/9555/computing-the-longest-common-substring-of-two-strings-using-suffix-arrays . Section 5 of https://courses.csail.mit.edu/6.851/spring12/scribe/lec16.pdf which also has an associated video lecture https://courses.csail.mit.edu/6.851/spring12/lectures/L16.html also explains the same algorithm starting at 1:16:00.
1 Answer 1
Python 2, 646 bytes
G=range;w=raw_input;z=L,m,h=[0]*3
s=w();t=len(s);s+='!%s#'%w();u=len(s);I=z*u
def f(s,n):
def r(o):
b=[[]for _ in s];c=[]
for x in B[:N]:b[s[x+o]]+=x,
map(c.extend,b);B[:N]=c
M=N=n--~n/3;t=n%3%2;B=G(n+t);del B[::3];r(2);u=m=p=r(1)>r(0);N-=n/3
for x in B*1:v=s[x:x+3];m+=u<v;u=v;B[x/3+x%3/2*N]=m
A=1/M*z or f(B+z,M)+z;B=[x*3for x in A if x<N];J=I[r(0):n];C=G(n)
for k in C:b=A[t]/N;a=i,j=A[t]%N*3-~b,B[p];q=p<N<(s[i:i-~b],J[i/3+b+N-b*N])>(s[j+t/M:j-~b],J[j/3+b*N]);C[k]=x=a[q];I[x]=k;p+=q;t+=1-q
return C
S=f(map(ord,s)+z*40,u)
for i in G(u):
h-=h>0;j=S[I[i]-1]
while s[i+h]==s[j+h]:h+=1
if(i<t)==(t<j)<=h>m:m=h;L=min(i,j)
print-~L,L+m
This uses the skew algorithm described in "Simple Linear Work Suffix Array Construction" by Kärkkäinen and Sanders. The C++ implementation included in that paper already feels a little "golfy", but there is still plenty of room for making it shorter. For example, we can recurse until arriving at an array of length one, instead of short circuiting as in the paper, without violating the O(n)
requirement.
For the LCP part, I followed "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications" by Kasai et al.
The program outputs 1 0
if the longest common substring is empty.
Here is some development code that includes an earlier version of the program that follows the C++ implementation a bit more closely, some slower approaches for comparison, and a simple test case generator:
from random import *
def brute(a,b):
L=R=m=0
for i in range(len(a)):
for j in range(i+m+1,len(a)+1):
if a[i:j] in b:
m=j-i
L,R=i,j
return L+1,R
def suffix_array_slow(s):
S=[]
for i in range(len(s)):
S+=[(s[i:],i)]
S.sort()
return [x[1] for x in S]
def slow1(a,b):
# slow suffix array, slow lcp
s=a+'!'+b
S=suffix_array_slow(s)
L=R=m=0
for i in range(1,len(S)):
x=S[i-1]
y=S[i]
p=s[x:]+'#'
q=s[y:]+'$'
h=0
while p[h]==q[h]:
h+=1
if h>m and len(a)==sorted([x,y,len(a)])[1]:
m=h
L=min(x,y)
R=L+h
return L+1,R
def verify(a,b,L,R):
if L<1 or R>len(a) or a[L-1:R] not in b:
return 0
LL,RR=brute(a,b)
return R-L==RR-LL
def rand_string():
if randint(0,1):
n=randint(0,8)
else:
n=randint(0,24)
a='zyxwvutsrq'[:randint(1,10)]
s=''
for _ in range(n):
s+=choice(a)
return s
def stress_test(f):
numtrials=2000
for trial in range(numtrials):
a=rand_string()
b=rand_string()
L,R=f(a,b)
if not verify(a,b,L,R):
LL,RR=brute(a,b)
print 'failed on',(a,b)
print 'expected:',LL,RR
print 'actual:',L,R
return
print 'ok'
def slow2(a,b):
# slow suffix array, linear lcp
s=a+'!'+b+'#'
S=suffix_array_slow(s)
I=S*1
for i in range(len(S)):
I[S[i]]=i
L=R=m=h=0
for i in range(len(S)):
if I[i]:
j=S[I[i]-1]
while s[i+h]==s[j+h]:
h+=1
if h>m and len(a)==sorted([i,j,len(a)])[1]:
m=h
L=min(i,j)
R=L+h
h-=h>0
return L+1,R
def suffix_array(s,K):
# skew algorithm
n=len(s)
s+=[0]*3
n0=(n+2)/3
n1=(n+1)/3
n2=n/3
n02=n0+n2
adj=n0-n1
def radix_pass(a,o,n=n02):
c=[0]*(K+3)
for x in a[:n]:
c[s[x+o]+1]+=1
for i in range(K+3):
c[i]+=c[i-1]
for x in a[:n]:
j=s[x+o]
a[c[j]]=x
c[j]+=1
A=[x for x in range(n+adj) if x%3]+[0]*3
radix_pass(A,2)
radix_pass(A,1)
radix_pass(A,0)
B=[0]*n02
t=m=0
for x in A[:n02]:
u=s[x:x+3]
m+=t<u
t=u
B[x/3+x%3/2*n0]=m
A[:n02]=1/n02*[0]or suffix_array(B,m)
I=A*1
for i in range(n02):
I[A[i]]=i+1
B=[3*x for x in A if x<n0]
radix_pass(B,0,n0)
R=[]
p=0
t=adj
while t<n02:
x=A[t]
b=x>=n0
i=(x-b*n0)*3-~b
j=B[p]
if p==n0 or ((s[i:i+2],I[A[t]-n0+1])<(s[j:j+2],I[j/3+n0]) if b else (s[i],I[A[t]+n0])<(s[j],I[j/3])):R+=i,;t+=1
else:R+=j,;p+=1
return R+B[p:n0]
def solve(a,b):
# linear
s=a+'!'+b+'#'
S=suffix_array(map(ord,s),128)
I=S*1
for i in range(len(S)):
I[S[i]]=i
L=R=m=h=0
for i in range(len(S)):
if I[i]:
j=S[I[i]-1]
while s[i+h]==s[j+h]:
h+=1
if h>m and len(a)==sorted([i,j,len(a)])[1]:
m=h
L=min(i,j)
R=L+h
h-=h>0
return L+1,R
stress_test(solve)
-
1\$\begingroup\$ Correct me if I'm wrong, but isn't this actually 739 bytes? I copied into mothereff.in/byte-counter and deleted 2 spaces from lines 6-9, but I'm not sure if that's correct. \$\endgroup\$Patrick Roberts– Patrick Roberts2017年01月26日 00:34:51 +00:00Commented Jan 26, 2017 at 0:34
-
2\$\begingroup\$ @PatrickRoberts Those are tabs. \$\endgroup\$Mitch Schwartz– Mitch Schwartz2017年01月26日 01:12:19 +00:00Commented Jan 26, 2017 at 1:12
-
2\$\begingroup\$ Nice answer! You might want to have a look at GSACA a novel linear time SACA from 2016. Reference implementation is 246 lines full of comments (170 without comments) and seems very golfable. You'll find it on github. \$\endgroup\$Christoph– Christoph2017年02月01日 09:15:41 +00:00Commented Feb 1, 2017 at 9:15
-
1\$\begingroup\$ @MitchSchwartz I'm currently trying to stay on noPMO, so I can't feel emotions strongly right now (probably due to inbalanced brain chemicals). At the time of reading the code quickly, my syntax golfing motor spotted that, and I don't remember feeling any specific emotions. Did you think of the same thing or why the question? :) Now I'm curious. \$\endgroup\$Yytsi– Yytsi2017年02月17日 12:35:24 +00:00Commented Feb 17, 2017 at 12:35
-
1\$\begingroup\$ @TuukkaX That's an interesting response that I wasn't expecting. Well, I'm not sure if I should be phrasing this in some special way, but the fact that your original comment wasn't actually correct played some part in why I decided to ask. :) \$\endgroup\$Mitch Schwartz– Mitch Schwartz2017年02月26日 02:50:44 +00:00Commented Feb 26, 2017 at 2:50
O(n) time
Are you sure it is possible? \$\endgroup\$