I am trying out the following former Google interview question.
The Challenge Given a string S and a set of words D, find the longest word in D that is a subsequence of S.
Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.
Note: D can appear in any format (list, hash table, prefix tree, etc.
For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple"
The words "able" and "ale" are both subsequences of S, but they are shorter than "apple". The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order. The word "kangaroo" is the longest word in D, but it isn't a subsequence of S. Learning objectives This question gives you the chance to practice with algorithms and data structures. It’s also a good example of why careful analysis for Big-O performance is often worthwhile, as is careful exploration of common and worst-case input conditions
My approach uses a greedy algorithm.
Sort D in descending order(longest word first)
Find the index of first character in a word
Scan S from
indexOfFirstCharacter
to find other characters in the wordIf we reach to the end of string and there are still characters remaining to be seen in a word then the word is not found
Repeat 2-4 until we find a word
from collections import deque
D = ["able","ale", "apple", "bale", "kangaroo"]
"""
TC : DlogD
"""
if len(D) > 1:
D.sort(key=len,reverse=True)
S = "abppplee"
s_dict_first = {}
"""
TC : O(S)
"""
for i,s in enumerate(S):
if s not in s_dict_first: s_dict_first[s] = i
"""
TC : O(1)
"""
#return first index of char
def getindex(char):
if char in s_dict_first:
return s_dict_first[char]
return -1
"""
TC : O(w + w +S) = O(w + s)
"""
def isSubstringPresent(startIndex,word):
#word is too big to be found in S
if len(word) - 2 > len(S) - 1 - startIndex:
return False
remaining_chars = list(word[1:])
char_queue = deque()
for char in remaining_chars:
char_queue.append(char)
for i in range(startIndex,len(S)):
if len(char_queue) == 0:return True
if S[i] == char_queue[0]:
char_queue.popleft()
return len(char_queue) == 0
"""
TC : O(DlogD + D(w + s))
"""
for word in D:
first_char = word[0]
indexStart = getindex(first_char) + 1
if indexStart == 0:continue
if isSubstringPresent(indexStart,word):
print(word)
break;
I am open to suggestion in improving the code / any bug in my time/space complexity analysis or any other useful comments.
2 Answers 2
How about using a regex for S, where all characters are optional by appending a ?
to each character:
import re
D = ["able","ale", "apple", "bale", "kangaroo"]
S = "abppplee"
regex = '^' + '?'.join(list(S)) + '?' + '$'
words = sorted(D, key=lambda word: len(word), reverse=True)
for word in words:
if re.match(regex, word):
print(word)
break
-
\$\begingroup\$ Did you test this code? \$\endgroup\$Gareth Rees– Gareth Rees2018年09月14日 20:13:18 +00:00Commented Sep 14, 2018 at 20:13
-
\$\begingroup\$ the new code should be working @GarethRees \$\endgroup\$jaksz– jaksz2018年09月14日 20:30:23 +00:00Commented Sep 14, 2018 at 20:30
-
\$\begingroup\$ How do you define the time and space complexity of a regex solution ? \$\endgroup\$user2650277– user26502772018年09月15日 06:43:29 +00:00Commented Sep 15, 2018 at 6:43
-
\$\begingroup\$ Looks better now, but what if
S=''
? \$\endgroup\$Gareth Rees– Gareth Rees2018年09月15日 12:30:40 +00:00Commented Sep 15, 2018 at 12:30 -
\$\begingroup\$ @GarethRees this is a valid question but it concerns edge cases, the OP was asking about implementation strategies in general. to polish the code for an actual implementation, one could use an
if
clause. \$\endgroup\$jaksz– jaksz2018年09月15日 14:31:57 +00:00Commented Sep 15, 2018 at 14:31
The code can be made quite a bit more readable by moving parts of the logic into a separate class. Also, sorting the candidates by length is not helpful, even harmful in certain situations. If the shortest candidate word is the only valid subsequence, you still need to test all others. The only thing you 'gain' from sorting is an increase in time complexity. The main method could look like this:
def longest_subsequence(string, words):
sequences = Sequences(string)
best = ""
for word in words:
if len(word) > len(best) and word in sequences:
best = word
return best
Preprocessing has moved into the Sequences
constructor and look-ups are now performed by the in
operator. The inner workings of the sequences
-object don't matter for now, as long as preprocessing is done in linear time and look-ups are fast, i.e. do not require an exhaustive search.
You could just move your preprocessing code (s_dict_first = {}; for i,s in ...
) into the constructor of Sequences
and your implementation of isSubstringPresent
into the in
-operator and be done, but we can do better. Instead of locating only the first character using a dictionary and then falling back to an exhaustive search, we can create dictionaries for all characters.
This can be done efficiently by walking the string backwards, starting with an empty dictionary. As we walk, we record the last (actually the next, since we're walking backwards) occurrence of each character. We then store all these dictionaries in a list.
class Sequences:
"""Fast lookups for subsequences in a fixed string"""
def __init__(self, string):
"""Constructor, preprocessing in O(n) time"""
n = len(string)
self.next = [None] * (n+1)
self.next[n] = {}
for i in range(n,0,-1):
self.next[i-1] = { **self.next[i], string[i-1]: i }
def __contains__(self, string):
"""Fast lookup"""
i = 0
for char in string:
if char not in self.next[i]: return False
i = self.next[i][char]
return True
If we later want to look up the word apple
, we go to the first dictionary and query it for the first occurrence of the character a
, which might be at position 10. We then grab the 10th dictionary from the list of dictionaries and query it for a p
, so we get the position of the first p
that comes after position 10. We continue until either a query fails or the entire word was found.
-
\$\begingroup\$ I forgot to set the key as len , that's why its gave wrong result.I didn't know about suffix trees.Can you elaborate more on how a suffix tree is applicable here since we can skip characters.
Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.
\$\endgroup\$user2650277– user26502772018年09月14日 03:57:24 +00:00Commented Sep 14, 2018 at 3:57 -
\$\begingroup\$ for
S = "abppplee"
will a suffix tree recognise "apple" even if there is a b after the "a" and an additional "p". ? \$\endgroup\$user2650277– user26502772018年09月14日 13:11:39 +00:00Commented Sep 14, 2018 at 13:11 -
\$\begingroup\$ Read the question again as this is exactly what is needed here.
For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple"
\$\endgroup\$user2650277– user26502772018年09月14日 13:21:00 +00:00Commented Sep 14, 2018 at 13:21 -
\$\begingroup\$ @user2650277 - Everything works correctly now. \$\endgroup\$Rainer P.– Rainer P.2018年09月14日 18:09:25 +00:00Commented Sep 14, 2018 at 18:09
-
\$\begingroup\$ The comment says, "preprocessing in \$O(n)\$ time" but I make it \$O(n^2)\$ since at each step the code copies out the previously constructed dictionary and adds a new entry if
string[i-1]
is different from all previously encountered characters. \$\endgroup\$Gareth Rees– Gareth Rees2018年09月15日 16:40:16 +00:00Commented Sep 15, 2018 at 16:40
D = ['abcd', 'bcd']
andS = 'abcd'
then this code printsbcd
. But the problem says that it should print the longest word that is a substring of the target, so it should printabcd
. \$\endgroup\$D.sort(key=len,reverse=True)
\$\endgroup\$