I have a fairly complex search problem that I've managed to reduce to the following description. I've been googling but haven't been able to find an algorithm that seems to fit my problem cleanly. In particular the need to skip arbitrary integers. Maybe someone here can point me to something?
Take a sequence of integers A, for example (1 2 3 4)
Take various sequences of integers and test if any of them match A such that.
- A contains all of the integers in the tested sequence
- The ordering of the integers in the tested sequence are the same in A
- We don't care about any integers in A that are not in the test sequence
- We want all matching test sequences, not just the first.
An example
A = (1 2 3 4)
B = (1 3)
C = (1 3 4)
D = (3 1)
E = (1 2 5)
B matches A
C matches A
D does not match A as the ordering is different
E does not match A as it contains an integer not in A
I hope that this explanation is clear enough. The best I've managed to do is to contruct a tree of the test sequences and iterate over A. The need to be able to skip integers leads to a lot of unsuccessful search paths.
Thanks
Reading some suggestions I feel that I have to clarify a couple of points that I left too vague.
Repeated numbers are allowed, in fact this is quite important as it allows a single test sequence to match A is multiple ways
A = (1234356), B = (236), matches could be either -23---6 or -2--3-6
I expect there to be a very large number of test sequences, in the thousands at least and sequence A will tend to have a max length of maybe 20. Thus simply trying to match each test sequence one by one by iterating becomes extremely inefficient.
Sorry if this wasn't clear.
-
4You sound as if you simply want to detect subsequences (en.wikipedia.org/wiki/Subsequence). Is that it? Then try searching for "subsequence algorithm".Kilian Foth– Kilian Foth2013年11月12日 12:38:02 +00:00Commented Nov 12, 2013 at 12:38
-
Honestly, a few thousand sequences with max length <= 20 does not sound a large number to me. A simple brute-force approach should do the trick. Or do you have thousands of sequences "A", each one to test against thousands of possible subsequences?Doc Brown– Doc Brown2013年11月12日 15:26:56 +00:00Commented Nov 12, 2013 at 15:26
-
There is a continual stream of sequences A but they are completely independent of each other. However a delay in processing one will directly delay all others, so speed is important.David Gibson– David Gibson2013年11月12日 15:53:30 +00:00Commented Nov 12, 2013 at 15:53
-
1How large is your alphabet? Do you really have arbitrary integers, or is there a finite range of values such that we could do some precalculations?Frank– Frank2013年11月13日 06:19:58 +00:00Commented Nov 13, 2013 at 6:19
-
The possible range of integers is in the 100,000sDavid Gibson– David Gibson2013年11月13日 09:29:57 +00:00Commented Nov 13, 2013 at 9:29
3 Answers 3
Hmm, I can think of two possible algorithms: A linear scan through the A sequence, or building a dictionary with constant-time lookup of the indices.
If you are testing many potential subsequences B against a single larger sequence A, I'd suggest you use the variant with the dictionary.
Linear Scan
Description
We maintain a cursor for the sequence A. Then we iterate through all items in the subsequence B. For each item, we move the cursor forward in A until we have found a matching item. If no matching item was found, then B isn't a subsequence.
This always runs in O(seq.size).
Pseudocode
Imperative style:
def subsequence? seq, subseq:
i = 0
for item in subseq:
i++ while i < seq.size and item != seq[i]
return false if i == seq.size
return true
Functional style:
let rec subsequence? = function
| _ [] -> true
| [] _ -> false
| cursor::seq item::subseq ->
if cursor = item
then subsequence? seq subseq
else subsequence? seq item::subseq
Example implementation (Perl):
use strict; use warnings; use signatures; use Test::More;
sub is_subsequence_i ($seq, $subseq) {
my $i = 0;
for my $item (@$subseq) {
$i++ while $i < @$seq and $item != $seq->[$i];
return 0 if $i == @$seq;
}
return 1;
}
sub is_subsequence_f ($seq, $subseq) {
return 1 if @$subseq == 0;
return 0 if @$seq == 0;
my ($cursor, @seq) = @$seq;
my ($item, @subseq) = @$subseq;
return is_subsequence_f(\@seq, $cursor == $item ? \@subseq : $subseq);
}
my $A = [1, 2, 3, 4];
my $B = [1, 3];
my $C = [1, 3, 4];
my $D = [3, 1];
my $E = [1, 2, 5];
for my $is_subsequence (\&is_subsequence_i, \&is_subsequence_f) {
ok $is_subsequence->($A, $B), 'B in A';
ok $is_subsequence->($A, $C), 'C in A';
ok ! $is_subsequence->($A, $D), 'D not in A';
ok ! $is_subsequence->($A, $E), 'E not in A';
ok $is_subsequence->([1, 2, 3, 4, 3, 5, 6], [2, 3, 6]), 'multiple nums';
}
done_testing;
Dictionary lookup
Description
We map the items of the sequence A to their indices. Then we look up suitable indices for each item in B, skip those indices that are to small, and pick the smallest possible index as a lower limit. When no indices are found, then B isn't a subsequence.
Runs in something like O(subseq.size · k), where k describes how many duplicate numbers there are in seq
. Plus an O(seq.size) overhead
The advantage of this solution is that a negative decision can be reached much faster (down to constant time), once you have paid the overhead of building the lookup table.
Pseudocode:
Imperative style:
# preparing the lookup table
dict = {}
for i, x in seq:
if exists dict[x]:
dict[x].append(i)
else:
dict[x] = [i]
def subsequence? subseq:
min_index = -1
for x in subseq:
if indices = dict[x]:
suitable_indices = indices.filter(_ > min_index)
return false if suitable_indices.empty?
min_index = suitable_indices[0]
else:
return false
return true
Functional style:
let subsequence? subseq =
let rec subseq-loop = function
| [] _ -> true
| x::subseq min-index ->
match (map (filter (_ > min-index)) data[x])
| None -> false
| Some([]) -> false
| Some(new-min::_) -> subseq-loop subseq new-min
in
subseq-loop subseq -1
Example implementation (Perl):
use strict; use warnings; use signatures; use Test::More;
sub build_dict ($seq) {
my %dict;
while (my ($i, $x) = each @$seq) {
push @{ $dict{$x} }, $i;
}
return \%dict;
}
sub is_subsequence_i ($seq, $subseq) {
my $min_index = -1;
my $dict = build_dict($seq);
for my $x (@$subseq) {
my $indices = $dict->{$x} or return 0;
($min_index) = grep { $_ > $min_index } @$indices or return 0;
}
return 1;
}
sub is_subsequence_f ($seq, $subseq) {
my $dict = build_dict($seq);
use feature 'current_sub';
return sub ($subseq, $min_index) {
return 1 if @$subseq == 0;
my ($x, @subseq) = @$subseq;
my ($new_min) = grep { $_ > $min_index } @{ $dict->{$x} // [] } or return 0;
__SUB__->(\@subseq, $new_min);
}->($subseq, -1);
}
my $A = [1, 2, 3, 4];
my $B = [1, 3];
my $C = [1, 3, 4];
my $D = [3, 1];
my $E = [1, 2, 5];
for my $is_subsequence (\&is_subsequence_i, \&is_subsequence_f) {
ok $is_subsequence->($A, $B), 'B in A';
ok $is_subsequence->($A, $C), 'C in A';
ok ! $is_subsequence->($A, $D), 'D not in A';
ok ! $is_subsequence->($A, $E), 'E not in A';
ok $is_subsequence->([1, 2, 3, 4, 3, 5, 6], [2, 3, 6]), 'multiple nums';
}
done_testing;
Dictionary Lookup Variant: Encoding as a Finite State Machine
Description
We can further reduce algorithmic complexity down to O(subseq.size) if we trade in more memory. Instead of mapping elements to their indices, we create a graph where each node represents an element at its index. The edges show possible transitions, e.g. the sequence a, b, a
would have the edges a@1 → b@2, a@1 → a@3, b@2 → a@3
. This graph is equivalent to a finite state machine.
During lookup we maintain a cursor which initially is the first node of the tree. We then walk the edge for each element in the sublist B. If no such edge exists, then B is no sublist. If after all elements the cursor contains a valid node, then B is a sublist.
Pseudocode
Imperative style:
# preparing the graph
graph = {}
for x in seq.reverse:
next_graph = graph.clone
next_graph[x] = graph
graph = next_graph
def subseq? subseq:
cursor = graph
for x in subseq:
cursor = graph[x]
return false if graph == null
return true
Functional style:
let subseq? subseq =
let rec subseq-loop = function
| [] _ -> true
| x::subseq graph -> match (graph[x])
| None -> false
| Some(next-graph) -> subseq-loop subseq next-graph
in
subseq-loop subseq graph
Example implementation (Perl):
use strict; use warnings; use signatures; use Test::More;
sub build_graph ($seq) {
my $graph = {};
for (reverse @$seq) {
$graph = { %$graph, $_ => $graph };
}
return $graph;
}
sub is_subsequence_i ($seq, $subseq) {
my $cursor = build_graph($seq);
for my $x (@$subseq) {
$cursor = $cursor->{$x} or return 0;
}
return 1;
}
sub is_subsequence_f ($seq, $subseq) {
my $graph = build_graph($seq);
use feature 'current_sub';
return sub ($subseq, $graph) {
return 1 if @$subseq == 0;
my ($x, @subseq) = @$subseq;
my $next_graph = $graph->{$x} or return 0;
__SUB__->(\@subseq, $next_graph);
}->($subseq, $graph);
}
my $A = [1, 2, 3, 4];
my $B = [1, 3];
my $C = [1, 3, 4];
my $D = [3, 1];
my $E = [1, 2, 5];
for my $is_subsequence (\&is_subsequence_i, \&is_subsequence_f) {
ok $is_subsequence->($A, $B), 'B in A';
ok $is_subsequence->($A, $C), 'C in A';
ok ! $is_subsequence->($A, $D), 'D not in A';
ok ! $is_subsequence->($A, $E), 'E not in A';
ok $is_subsequence->([1, 2, 3, 4, 3, 5, 6], [2, 3, 6]), 'multiple nums';
}
done_testing;
-
As an aside, have you poked at how
study
works and if the algorithms it applies could have some practical application here?user40980– user409802013年11月12日 18:04:51 +00:00Commented Nov 12, 2013 at 18:04 -
1@MichaelT I'm not sure I understand ... I am an undergrad, but I haven't yet discovered how to actually study </joke>. If you are talking about the Perl builtin function: It's a no-op nowadays. The current implementation is just a dozen lines of backwards compatibility. The regex engine employs such heuristics directly, such as searching for constant strings before matching variable-sized patterns.
study
had previously built character-to-position lookup tables, not unlike my second solution.amon– amon2013年11月12日 19:06:31 +00:00Commented Nov 12, 2013 at 19:06 -
updated with even better algorithmamon– amon2013年11月13日 19:00:53 +00:00Commented Nov 13, 2013 at 19:00
-
Elaborating more on that FSM, you could 'compile' all the test sequences into one FSM and then run through the entire sequence. Depending on which state you were in at the end determines which subsequences were matched. Its certainly the thing that one would rather have the computer do rather than do by hand for any non-trivial one.user40980– user409802013年11月15日 22:37:23 +00:00Commented Nov 15, 2013 at 22:37
-
@MichaelT You are correct that we could build a recognizer this way. However, we are already down to n·O(B) + initialization cost in O(f(A)). Building the trie-like structure of all Bs would take something like O(n·B), with the matching being in O(A). This has a real chance of being cheaper theoretically (building the graph in the 3rd solution can be expensive, but it's just a one-time cost). I think a trie is better suited for A ≫ n·B, and has the disadvantage that it can't handle streaming input – all Bs have to be loaded prior to matching. I'll probably update the answer in 6h.amon– amon2013年11月16日 07:25:43 +00:00Commented Nov 16, 2013 at 7:25
Here is a practical approach which avoids "the hard work" of implementing your own algorithm, and also avoids to "reinvent the wheel": utilize a regular expression engine for the problem.
Just put all numbers of A into a string, and all numbers of B into a string separated by the regular expression (.*)
. Add a ^
character at the beginning and $
at the end. Then let your favorite regular expression engine search for all matches. For example, when
A = (1234356), B = (236)
create a reg exp for B like ^(.*)2(.*)3(.*)6(.*)$
. Now run a global regexp search. To find out at which positions in A your subsequence matches, just check the length of the first 3 submatches.
If your range of integers leaves 0 to 9, you may consider to encode them by single letters first to make this work, or you have to adapt the idea using a separation character.
Of course, the speed of that approach will depend a lot on the speed of the reg exp engine your are using, but there are highly optimized engines available, and I guess it will be hard to implement a faster algorithm "out of the box".
-
One need not go all the way to invoking a regex and its engine. It would be possible to use a simple deterministic finite automata to run it. Its a 'straight line' route through.user40980– user409802013年11月12日 17:40:48 +00:00Commented Nov 12, 2013 at 17:40
-
@MichaelT: well, I don't have a "generic finite automata" library at hand, and the OP did not tell us about the programming language he uses, but regular expressions are available today for almost every serious programming language "out the box". That should make my suggestion very easy to implement, with much less code than, for example, amon's solution. IMHO the OP should give it a try, if it is too slow for him, he could still try if a more complicated solution will serve him better.Doc Brown– Doc Brown2013年11月12日 17:53:58 +00:00Commented Nov 12, 2013 at 17:53
-
You don't need a generic library. All you need is the array of the 'pattern' and a pointer to the index in the array. The index points to the next "looking for" value, and when you read it from the source, increment the index. When you hit the end of the array, you've matched it. If you read the end of the source without reaching the end, you haven't matched it.user40980– user409802013年11月12日 17:58:23 +00:00Commented Nov 12, 2013 at 17:58
-
@MichaelT: then why don't you post a sketch of that algorithm as an answer?Doc Brown– Doc Brown2013年11月12日 18:00:19 +00:00Commented Nov 12, 2013 at 18:00
-
Mostly because its already better answered already - "We maintain a cursor for the sequence A. Then we iterate through all items in the subsequence B. For each item, we move the cursor forward in A until we have found a matching item. If no matching item was found, then B isn't a subsequence."user40980– user409802013年11月12日 18:03:41 +00:00Commented Nov 12, 2013 at 18:03
This algorithm should be quite efficient if getting the length and iterating the sequence is efficient.
- Compare the length of both sequences. Store the longer in
sequence
and the shorter insubsequence
- Start at the beginning of both sequences and loop until the end of
sequence
.- Is the number at the current position of
sequence
equal to to the number at the current postion ofsubsequence
- If yes, move both positions one further
- If not, move only the position of
sequence
one further
- Is the number at the current position of
- Is the position of
subsequence
at the end of thesequence
- If yes, the two sequences match
- If not, the two sequences don't match