Longest word in dictionary that is a subsequence of a given string

Question 1

I am trying out the following former Google interview question.

The Challenge Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

Note: D can appear in any format (list, hash table, prefix tree, etc.

For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple"

The words "able" and "ale" are both subsequences of S, but they are shorter than "apple". The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order. The word "kangaroo" is the longest word in D, but it isn't a subsequence of S. Learning objectives This question gives you the chance to practice with algorithms and data structures. It’s also a good example of why careful analysis for Big-O performance is often worthwhile, as is careful exploration of common and worst-case input conditions

My approach uses a greedy algorithm.

Sort D in descending order(longest word first)
Find the index of first character in a word
Scan S from indexOfFirstCharacter to find other characters in the word
If we reach to the end of string and there are still characters remaining to be seen in a word then the word is not found
Repeat 2-4 until we find a word

from collections import deque
D = ["able","ale", "apple", "bale", "kangaroo"]
"""
TC : DlogD 
"""
if len(D) > 1:
 D.sort(key=len,reverse=True)
S = "abppplee"
s_dict_first = {}
"""
TC : O(S) 
"""
for i,s in enumerate(S):
 if s not in s_dict_first: s_dict_first[s] = i
"""
TC : O(1)
"""
#return first index of char
def getindex(char):
 if char in s_dict_first:
 return s_dict_first[char]
 return -1
"""
TC : O(w + w +S) = O(w + s)
"""
def isSubstringPresent(startIndex,word):
 #word is too big to be found in S
 if len(word) - 2 > len(S) - 1 - startIndex:
 return False
 remaining_chars = list(word[1:])
 char_queue = deque()
 for char in remaining_chars:
 char_queue.append(char)
 for i in range(startIndex,len(S)):
 if len(char_queue) == 0:return True 
 if S[i] == char_queue[0]:
 char_queue.popleft()
 return len(char_queue) == 0
"""
TC : O(DlogD + D(w + s)) 
"""
for word in D:
 first_char = word[0]
 indexStart = getindex(first_char) + 1
 if indexStart == 0:continue
 if isSubstringPresent(indexStart,word):
 print(word)
 break;

I am open to suggestion in improving the code / any bug in my time/space complexity analysis or any other useful comments.

Question 2

If I set D = ['abcd', 'bcd'] and S = 'abcd' then this code prints bcd. But the problem says that it should print the longest word that is a substring of the target, so it should print abcd.

Question 3

@GarethRees you are right , i forgot to set the key as len for sorting , that's why it fails.D.sort(key=len,reverse=True)

Question 4

@GarethRees could you please remove the hold as i had fixed the bug (check my edit)

Question 5

Sure, no problem.

Question 6

How about using a regex for S, where all characters are optional by appending a ? to each character:

import re
D = ["able","ale", "apple", "bale", "kangaroo"]
S = "abppplee"
regex = '^' + '?'.join(list(S)) + '?' + '$'
words = sorted(D, key=lambda word: len(word), reverse=True)
for word in words:
 if re.match(regex, word):
 print(word)
 break

Question 7

Did you test this code?

Question 8

the new code should be working @GarethRees

Question 9

How do you define the time and space complexity of a regex solution ?

Question 10

Looks better now, but what if S=''?

Question 11

@GarethRees this is a valid question but it concerns edge cases, the OP was asking about implementation strategies in general. to polish the code for an actual implementation, one could use an if clause.

Question 12

The code can be made quite a bit more readable by moving parts of the logic into a separate class. Also, sorting the candidates by length is not helpful, even harmful in certain situations. If the shortest candidate word is the only valid subsequence, you still need to test all others. The only thing you 'gain' from sorting is an increase in time complexity. The main method could look like this:

def longest_subsequence(string, words):
 sequences = Sequences(string)
 best = ""
 for word in words:
 if len(word) > len(best) and word in sequences:
 best = word
 return best

Preprocessing has moved into the Sequences constructor and look-ups are now performed by the in operator. The inner workings of the sequences-object don't matter for now, as long as preprocessing is done in linear time and look-ups are fast, i.e. do not require an exhaustive search.

You could just move your preprocessing code (s_dict_first = {}; for i,s in ...) into the constructor of Sequences and your implementation of isSubstringPresent into the in-operator and be done, but we can do better. Instead of locating only the first character using a dictionary and then falling back to an exhaustive search, we can create dictionaries for all characters.

This can be done efficiently by walking the string backwards, starting with an empty dictionary. As we walk, we record the last (actually the next, since we're walking backwards) occurrence of each character. We then store all these dictionaries in a list.

class Sequences:
 """Fast lookups for subsequences in a fixed string"""
 def __init__(self, string):
 """Constructor, preprocessing in O(n) time"""
 n = len(string)
 self.next = [None] * (n+1)
 self.next[n] = {}
 for i in range(n,0,-1):
 self.next[i-1] = { **self.next[i], string[i-1]: i }
 def __contains__(self, string):
 """Fast lookup"""
 i = 0
 for char in string:
 if char not in self.next[i]: return False
 i = self.next[i][char]
 return True

If we later want to look up the word apple, we go to the first dictionary and query it for the first occurrence of the character a, which might be at position 10. We then grab the 10th dictionary from the list of dictionaries and query it for a p, so we get the position of the first p that comes after position 10. We continue until either a query fails or the entire word was found.

Question 13

I forgot to set the key as len , that's why its gave wrong result.I didn't know about suffix trees.Can you elaborate more on how a suffix tree is applicable here since we can skip characters.

Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

Question 14

for S = "abppplee" will a suffix tree recognise "apple" even if there is a b after the "a" and an additional "p". ?

Question 15

Read the question again as this is exactly what is needed here.

For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple"

Question 16

@user2650277 - Everything works correctly now.

Question 17

The comment says, "preprocessing in \$O(n)\$ time" but I make it \$O(n^2)\$ since at each step the code copies out the previously constructed dictionary and adds a new entry if string[i-1] is different from all previously encountered characters.

jaksz jakszjaksz 212 bronze badges · Answer 1 · 2018-09-14 14:07:34Z

2

\$\begingroup\$

How about using a regex for S, where all characters are optional by appending a ? to each character:

import re
D = ["able","ale", "apple", "bale", "kangaroo"]
S = "abppplee"
regex = '^' + '?'.join(list(S)) + '?' + '$'
words = sorted(D, key=lambda word: len(word), reverse=True)
for word in words:
 if re.match(regex, word):
 print(word)
 break

Share

edited Sep 14, 2018 at 20:25

answered Sep 14, 2018 at 14:07

jaksz's user avatar

jaksz jakszjaksz

212 bronze badges

\$\endgroup\$

6

\$\begingroup\$ Did you test this code? \$\endgroup\$

Gareth Rees
– Gareth Rees

2018年09月14日 20:13:18 +00:00
Commented Sep 14, 2018 at 20:13
\$\begingroup\$ the new code should be working @GarethRees \$\endgroup\$

jaksz
– jaksz

2018年09月14日 20:30:23 +00:00
Commented Sep 14, 2018 at 20:30
\$\begingroup\$ How do you define the time and space complexity of a regex solution ? \$\endgroup\$

user2650277
– user2650277

2018年09月15日 06:43:29 +00:00
Commented Sep 15, 2018 at 6:43
\$\begingroup\$ Looks better now, but what if S=''? \$\endgroup\$

Gareth Rees
– Gareth Rees

2018年09月15日 12:30:40 +00:00
Commented Sep 15, 2018 at 12:30
\$\begingroup\$ @GarethRees this is a valid question but it concerns edge cases, the OP was asking about implementation strategies in general. to polish the code for an actual implementation, one could use an if clause. \$\endgroup\$

jaksz
– jaksz

2018年09月15日 14:31:57 +00:00
Commented Sep 15, 2018 at 14:31

| Show 1 more comment

Rainer P. Rainer P.Rainer P. 3,08512 silver badges18 bronze badges · Answer 2 · 2018-09-13 21:03:51Z

The code can be made quite a bit more readable by moving parts of the logic into a separate class. Also, sorting the candidates by length is not helpful, even harmful in certain situations. If the shortest candidate word is the only valid subsequence, you still need to test all others. The only thing you 'gain' from sorting is an increase in time complexity. The main method could look like this:

def longest_subsequence(string, words):
 sequences = Sequences(string)
 best = ""
 for word in words:
 if len(word) > len(best) and word in sequences:
 best = word
 return best

Preprocessing has moved into the Sequences constructor and look-ups are now performed by the in operator. The inner workings of the sequences-object don't matter for now, as long as preprocessing is done in linear time and look-ups are fast, i.e. do not require an exhaustive search.

You could just move your preprocessing code (s_dict_first = {}; for i,s in ...) into the constructor of Sequences and your implementation of isSubstringPresent into the in-operator and be done, but we can do better. Instead of locating only the first character using a dictionary and then falling back to an exhaustive search, we can create dictionaries for all characters.

This can be done efficiently by walking the string backwards, starting with an empty dictionary. As we walk, we record the last (actually the next, since we're walking backwards) occurrence of each character. We then store all these dictionaries in a list.

class Sequences:
 """Fast lookups for subsequences in a fixed string"""
 def __init__(self, string):
 """Constructor, preprocessing in O(n) time"""
 n = len(string)
 self.next = [None] * (n+1)
 self.next[n] = {}
 for i in range(n,0,-1):
 self.next[i-1] = { **self.next[i], string[i-1]: i }
 def __contains__(self, string):
 """Fast lookup"""
 i = 0
 for char in string:
 if char not in self.next[i]: return False
 i = self.next[i][char]
 return True

If we later want to look up the word apple, we go to the first dictionary and query it for the first occurrence of the character a, which might be at position 10. We then grab the 10th dictionary from the list of dictionaries and query it for a p, so we get the position of the first p that comes after position 10. We continue until either a query fails or the entire word was found.

I forgot to set the key as len , that's why its gave wrong result.I didn't know about suffix trees.Can you elaborate more on how a suffix tree is applicable here since we can skip characters.Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.
for S = "abppplee" will a suffix tree recognise "apple" even if there is a b after the "a" and an additional "p". ?
Read the question again as this is exactly what is needed here.For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple"
The comment says, "preprocessing in \$O(n)\$ time" but I make it \$O(n^2)\$ since at each step the code copies out the previously constructed dictionary and adds a new entry if string[i-1] is different from all previously encountered characters.

Stack Exchange Network

Longest word in dictionary that is a subsequence of a given string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Longest word in dictionary that is a subsequence of a given string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions