I will try to explain my objective with an example which will be easier to understand.
Suppose I have a sentence like "A B C D E F G H".(Each word seperated using single space).
I have a Database like:
B C D E
A B
F G
C D E F G
..
I need to find the maximum number of units from the database that match the sentence. The outcome should be such that "C D E F G" should be disocvered first instead of "B C D E" and after finding "C D E F G" only the remaining part of the sentence should be matched against the database. Any change/suggestion is welcomed.
The maximum match units can be anywhere in the sentence.The problem here is that I could not come up with an algorithm which can accomplish this task.
-
2Maybe you should look how to transform your database into a trie ( not a tree ! )Christophe– Christophe2016年08月08日 11:38:37 +00:00Commented Aug 8, 2016 at 11:38
-
is it part fo the requirements than you can skip letters in the match? ie A C D E F G H would be a matchEwan– Ewan2016年08月16日 15:06:41 +00:00Commented Aug 16, 2016 at 15:06
-
yes they should be skipped.Nagaraju– Nagaraju2016年08月17日 04:29:19 +00:00Commented Aug 17, 2016 at 4:29
-
2You should probably clarify that in the question as it makes a massive differenceEwan– Ewan2016年08月17日 14:33:49 +00:00Commented Aug 17, 2016 at 14:33
3 Answers 3
I would also suggest to pre-process your database into a trie and let me extend the corresponding comment given by Christophe:
Take a look at the Aho-Corasick algorithm for the construction of such a trie. The general idea is to have a tree-based data structure that you can follow.
In your case, let's consider you also had the database entry A B C E
. You start examining the trie with A
(because that's the first character in your query string). There is such a node, because your valid patterns include A B
and A B C E
. You then look at the next character of your query, B
, and find a first complete match A B
. Since you need the longest, you'd want to continue of course. The node you reached via A B
is also the same node that continues on to form A B C E
. So you continue on with C
and then when you want to continue with D
you have no valid pattern with a prefix of A B C D
left.
First off: you could just repeat this process starting at B
and so on for every letter of your query. That'll give you the correct answer via the final maximum over the length of all found patterns. It's not very efficient though, since you can imagine running over the same substrings over and over.
Enter the magic of the shortcut links: Your trie contains a node representing the prefix A B C
(from the A B C E
pattern) and your query matched that prefix. You then know that the query minus the exhausted A
starts with B C
. You also know in advance, that B C D E
has B C
as a prefix. So your trie can be given a shortcut from the node A B C
to the B C
node. You do not need to start over, but can continue directly with D
from your query. And then E
and you matched B C D E
. Since there is no longer match for F
available, you need to rinse and repeat.
Again, the shortcut from the B C D E
node to the C D E
node allows you to recognize C D E F G
within two more steps. By now, you have discovered A B
, B C D E
and C D E F G
while only looking at the input characters A B C D E F G
once.
For completeness, after you matched C D E F G
, you would jump directly to matching F G
. Since your best available shortcut (with the limited example words you gave) is F G
- there is nothing starting with D E F G
and nothing starting with E F G
, so F G
is your best bet and the shortcut points to that node.
Special note: You may have to take special care to recognize sub-pattern words. For example, consider the database containing A B C D E F G
and C D E
and your input is A B C D E F H
. You will end up in the A B C D E F
node and have no more shortcut available, so you'll start from the beginning with H
, but meanwhile you have to take care of having matched C D E
.
This approach, however, requires you to pre-process your entire database up front. On the plus side, you get very fast linear searches (linear in several parameters - you can read up the details on the linked wikipedia page).
In MySQL 5.6 ...
create table tt ( data varchar(20) ); insert into tt values ('B C D E'); insert into tt values ('A B'); insert into tt values ('F G'); insert into tt values ('C D E F G'); insert into tt values ('X'); insert into tt values ('E D C B'); insert into tt values ('B A'); insert into tt values ('G F'); insert into tt values ('G F E D C'); select data,length(data) from tt where 'A B C D E F G' rlike data; | data | length(data) | |-----------|--------------| | B C D E | 7 | | A B | 3 | | F G | 3 | | C D E F G | 9 |
Added query to count maximum number of spaces (and thus maximum number of units matched)
select data,max(length(data)-length(replace(data,' ',''))+1) num_units_matched from tt where 'A B C X D E F G' rlike data group by data order by 2 desc limit 1; | data | num_units_matched | |-----------|-------------------| | A B | 2 |
-
yeah, i was thinking that! i assume the 'words' can be of varying length, but then you can just count the spacesEwan– Ewan2016年08月16日 15:04:14 +00:00Commented Aug 16, 2016 at 15:04
-
Added query to count maximum number of spaces (and thus maximum number of units)blackpen– blackpen2016年08月17日 14:15:47 +00:00Commented Aug 17, 2016 at 14:15
Building on what @Frank wrote, I tried to make a walk-through of matching with trie-dictionary.
The first column shows the input string read so far (and processed). The second shows all the partial matched positions in the trie. The third shows the longest successful matches.
At every iteration, the next input character is catenated to each of the partial matched positions from the previous iteration and used to search the trie for matches. The current input character is also used independently to search the trie for matches (as if the input character were catenated to an empty string).
Input String to be matched: ABCDEFGH
Dictionary (which is turned into a trie for searching)
(1) A
(2) ABC
(3) BC
(4) BCD
(5) BCDE
(6) CD
(7) CDE
(8) CDEF
(9) CDEFG
Input String Positions of partial matches in the Trie Longest matches
============ ================================================= ===============
A A->(1,2)
AB AB->(2) ; B->(3,4,5)
ABC ABC->(2); BC->(3,4,5); C->(6,7,8,9)
ABCD ABCD->(Stop); BCD->(4,5); CD->(6,7,8,9); D->(Stop) ABC
ABCDE BCDE->(5), CDE->(7,8,9); E->(Stop)
ABCDEF BCDEF->(Stop); CDEF->(8,9); F->(Stop) BCDE
ABCDEFG CDEFG->(8); G->(Stop)
ABCDEFGH CDEFGH->(Stop); H->(Stop) CDEFG
-
@Huperniketes, Nice phrasing "as if the input character were catenated".blackpen– blackpen2016年08月19日 16:14:27 +00:00Commented Aug 19, 2016 at 16:14
-
Glad you like my revision. Your answer clearly explained tries and is much appreciated. I didn't understand them before, and I didn't want to corrupt the meaning.Huperniketes– Huperniketes2016年08月22日 10:59:12 +00:00Commented Aug 22, 2016 at 10:59