Find rows where integer sequence contains a given subsequence

Question 1

Problem

Note: I refer to the mathematical sequences, not the sequences mechanism of PostgreSQL.

I have a table representing sequences of integers. The definition is:

CREATE TABLE sequences
(
 id serial NOT NULL,
 title character varying(255) NOT NULL,
 date date NOT NULL,
 sequence integer[] NOT NULL,
 CONSTRAINT "PRIM_KEY_SEQUENCES" PRIMARY KEY (id)
);

My goal is to find rows using a given subsequence. That is to say, the rows where the sequence field is a sequence that contains the given subsequence (in my case, the sequence is ordered).

Example

Suppose the table contains following data:

+----+-------+------------+-------------------------------+
| id | title | date | sequence |
+----+-------+------------+-------------------------------+
| 1 | BG703 | 2004年12月24日 | {1,3,17,25,377,424,242,1234} |
| 2 | BG256 | 2005年05月11日 | {5,7,12,742,225,547,2142,223} |
| 3 | BD404 | 2004年10月13日 | {3,4,12,5698,526} |
| 4 | BK956 | 2004年08月17日 | {12,4,3,17,25,377,456,25} |
+----+-------+------------+-------------------------------+

So if the given subsequence is {12, 742, 225, 547}, I want to find the row 2.

Similarly, if the given subsequence is {3, 17, 25, 377}, I want to find the row 1 and the row 4.

Finally, if the given subsequence is {12, 4, 3, 25, 377}, then there are no rows returned.

Investigations

First, I am not completely certain that represent sequences with an array data type is wise. Although this seems appropriate to the situation; I fear it makes more complicated handling. Perhaps it is better to represent the sequences differently, using a model of relations with another table.

In the same way, I think about expand the sequences using the unnest array function and then add my search criteria. Nevertheless, the number of terms in the sequence being variable I do not see how to do that.

I know it is also possible to cut my sequence in subsequence using the subarray function of intarray module but I do not see how it benefit me for my search.

Constraints

Even if at the moment my model is still being developed, the table is intended to be composed of many sequences, between 50,000 and 300,000 rows. So I have a strong performance constraint.

In my example I used relatively small integers. In practice, it is possible that these integers become much larger, up to overflow bigint. In such a situation, I think the best is to store numbers as strings (since it is not necessary to perform these sequences of mathematical operations). However, opting for this solution, this makes it impossible to use the intarray module, mentioned above.

Question 2

If they can overflow bigint you should use numeric as the type to store them. It's a lot slower and takes way more room though.

Question 3

@CraigRinger Why use numeric and not a string (text for example)? I do not need to perform mathematical operations on my sequences.

Question 4

Because it's more compact and in many ways faster than text, and prevents you from storing bogus non-numeric data. Depends, if you're only doing I/O you might want text to reduce I/O processing.

Question 5

@CraigRinger Indeed, the type is more consistent. Regarding performance, I will test when I had found a way to do my search.

Question 6

@CraigRinger It might work if the order does not matter. But here, the sequences are ordered. Example: SELECT ARRAY[12, 4, 3, 17, 25, 377, 456, 25] @> ARRAY[12, 4, 3, 25, 377]; will return true, because the order is not considered by this operator.

Question 7

If you are looking for significant performance improvements to dnoeth's answer, consider using a native C-function and creating the appropriate operator.

Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).

Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
 return DirectFunctionCall2(_int_contains_sequence,
 PG_GETARG_DATUM(1),
 PG_GETARG_DATUM(0));
}
Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
 ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
 ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
 int na, nb;
 int32 *pa, *pb;
 int i, j;
 na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
 nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
 pa = (int32 *) ARR_DATA_PTR(a);
 pb = (int32 *) ARR_DATA_PTR(b);
 /* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
 for (i = 0; i <= na - nb; ++i)
 {
 for (j = 0; j < nb; ++j)
 if (pa[i + j] != pb[j])
 break;
 if (j == nb)
 PG_RETURN_BOOL(true);
 }
 PG_RETURN_BOOL(false);
}

CREATE FUNCTION _int_contains_sequence(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE FUNCTION _int_sequence_contained(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE OPERATOR @@> (
 LEFTARG = _int4,
 RIGHTARG = _int4,
 PROCEDURE = _int_contains_sequence,
 COMMUTATOR = '<@@',
 RESTRICT = contsel,
 JOIN = contjoinsel
);
CREATE OPERATOR <@@ (
 LEFTARG = _int4,
 RIGHTARG = _int4,
 PROCEDURE = _int_sequence_contained,
 COMMUTATOR = '@@>',
 RESTRICT = contsel,
 JOIN = contjoinsel
);

Now you can filter rows like this.

SELECT * FROM sequences WHERE sequence @@> '{12, 742, 225, 547}'

I have conducted a little experiment to find how much faster this solution is.

CREATE TEMPORARY TABLE sequences AS
SELECT array_agg((random() * 10)::int4) AS sequence, g1 AS id
FROM generate_series(1, 100000) g1
 CROSS JOIN generate_series(1, 30) g2
GROUP BY g1;

EXPLAIN ANALYZE SELECT * FROM sequences
WHERE translate(cast(sequence as text), '{}',',,')
 LIKE '%' || translate(cast('{1,2,3,4}'as text), '{}',',,') || '%'
"Seq Scan on sequences (cost=0.00..7869.42 rows=28 width=36) (actual time=2.487..334.318 rows=251 loops=1)"
" Filter: (translate((sequence)::text, '{}'::text, ',,'::text) ~~ '%,1,2,3,4,%'::text)"
" Rows Removed by Filter: 99749"
"Planning time: 0.104 ms"
"Execution time: 334.365 ms"

EXPLAIN ANALYZE SELECT * FROM sequences WHERE sequence @@> '{1,2,3,4}'
"Seq Scan on sequences (cost=0.00..5752.01 rows=282 width=36) (actual time=0.178..20.792 rows=251 loops=1)"
" Filter: (sequence @@> '{1,2,3,4}'::integer[])"
" Rows Removed by Filter: 99749"
"Planning time: 0.091 ms"
"Execution time: 20.859 ms"

So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.

Question 8

Sounds interesting, however I use either strings or the type numeric to represent my data because they may overflow bigint. It might be well to edit your answer to match the constraints of the question. Anyway, I will do a comparative performance which I'll post here.

Question 9

I am not sure if it is a good practice to paste large blocks of code into answers as they are supposed to be minimal and verifiable. A generic array version of this function is four times longer and quite cumbersome. I have also tested it with numeric and text and the improvement ranged from 20 to 50 times depending on the length of arrays.

Question 10

Yes, however it is necessary that the answers answered questions :-). Here, it seems to me that a answer complying with the constraints is interesting (because this aspect is a part of the question). Nonetheless, it may not be necessary to propose a generic version. Just a version with strings or numeric.

Question 11

Anyway, I added the version for generic arrays since it would be nearly the same for any variable-length data type. But if you are really concerned about performance, you should stick with fixed-size data types like bigint.

Question 12

I would love to do that. The problem is that some of my sequences overflow far beyond bigint, so it seems I have no choice. But if you have an idea, I'm interested :).

Question 13

You can easily find the subsequence when you cast the arrays to strings and replace the curly brackets with commas:

translate(cast(sequence as varchar(10000)), '{}',',,')
{1,3,17,25,377,424,242,1234} -> ',1,3,17,25,377,424,242,1234,'

Do the same for the array you're searching for and add a leading and trailing %:

'%' || translate(cast(searchedarray as varchar(10000)), '{}',',,') || '%'
{3, 17, 25, 377} -> '%,3,17,25,377,%'

Now you compare it using LIKE:

WHERE translate(cast(sequence as varchar(10000)), '{}',',,')
 LIKE '%' || translate(cast(searchedarray as varchar(10000)), '{}',',,') || '%'

Edit:

Fiddle is working again.

If the arrays are normalized into one row per value you can apply set based logic:

CREATE TABLE sequences
( id int NOT NULL,
 n int not null,
 val numeric not null
);
insert into sequences values( 1, 1,1 );
insert into sequences values( 1, 2,3 );
insert into sequences values( 1, 3,17 );
insert into sequences values( 1, 4,25 );
insert into sequences values( 1, 5,377 );
insert into sequences values( 1, 6,424 );
insert into sequences values( 1, 7,242 );
insert into sequences values( 1, 8,1234 );
insert into sequences values( 2, 1,5 );
insert into sequences values( 2, 2,7 );
insert into sequences values( 2, 3,12 );
insert into sequences values( 2, 4,742 );
insert into sequences values( 2, 5,225 );
insert into sequences values( 2, 6,547 );
insert into sequences values( 2, 7,2142 );
insert into sequences values( 2, 8,223 );
insert into sequences values( 3, 1,3 );
insert into sequences values( 3, 2,4 );
insert into sequences values( 3, 3,12 );
insert into sequences values( 3, 4,5698 );
insert into sequences values( 3, 5,526 ); 
insert into sequences values( 4, 1,12 );
insert into sequences values( 4, 2,4 );
insert into sequences values( 4, 3,3 );
insert into sequences values( 4, 4,17 );
insert into sequences values( 4, 5,25 );
insert into sequences values( 4, 6,377 );
insert into sequences values( 4, 7,456 );
insert into sequences values( 4, 8,25 );
insert into sequences values( 5, 1,12 );
insert into sequences values( 5, 2,4 );
insert into sequences values( 5, 3,3 );
insert into sequences values( 5, 4,17 );
insert into sequences values( 5, 5,17 );
insert into sequences values( 5, 6,25 );
insert into sequences values( 5, 7,377 );
insert into sequences values( 5, 8,456 );
insert into sequences values( 5, 9,25 );

n must be sequential, no duplicates, no gaps. Now join on common values and exploit the fact that sequences are sequential :-)

with searched (n,val) as (
 VALUES
 ( 1,3 ),
 ( 2,17 ),
 ( 3,25 ),
 ( 4,377)
)
select seq.id, 
 -- this will return the same result if the values from both tables are in the same order
 -- it's a meaningless dummy, but the same meaningless value for sequential rows 
 seq.n - s.n as dummy,
 seq.val,
 seq.n,
 s.n 
from sequences as seq join searched as s
on seq.val = s.val
order by seq.id, dummy, seq.n;

Finally count the number of rows with the same dummy and check if it's the correct number:

with searched (n,val) as (
 VALUES
 ( 1,3 ),
 ( 2,17 ),
 ( 3,25 ),
 ( 4,377)
)
select distinct seq.id
from sequences as seq join searched as s
on seq.val = s.val
group by 
 seq.id,
 seq.n - s.n
having count(*) = (select count(*) from searched)
;

Try an index on sequences(val,id,n).

Question 14

I also considered this solution afterwards. But I see several problems that seem quite bothersome: first of all I'm afraid that this solution is very inefficient, we must cast each array of each row before making a search pattern. It is possible to consider storing sequences in a TEXT field (varchar is a bad idea in my opinion, sequences can be long, as the numbers, so the size is rather unpredictable), to avoid cast; but it is still not possible to use indexes to improve performances (furthermore use a string field does not seem necessarily judicious, see comment of @CraigRinger above).

Question 15

@mlpo: What's your performance expectation? To be able to use an index you must normalize the sequence into one row per value, apply a Relational Division and finally check if the order is correct. In your example 25 exists twice in id=4, is this actually possible? How many matches exists in average/maximum for a searched sequence?

Question 16

A sequence may contain several times the same number. For example {1, 1, 1, 1, 12, 2, 2, 12, 12, 1, 1, 5, 4} is quite possible. Regarding the number of matches, the subsequences used are normally thought to limit the number of results. However, some sequences are very similar, and it can sometimes be interesting to use a subsequence shorter to get more results. I estimate that the number of matches for the majority of cases is between 0 and 100. With always the possibility that occasionally the subsequence match with a lot of sequences when it is short or very common.

Question 17

@mlpo: I added a set-based solution and I would be very interested in some performance comparison :-)

Question 18

@ypercube: This was just a quick add to return a more meaningful result :-) Ok, it's horrible, I'll change it.l

Slonopotamus Slonopotamus 3761 gold badge3 silver badges6 bronze badges · Answer 1 · 2015-08-04 10:30:13Z

If you are looking for significant performance improvements to dnoeth's answer, consider using a native C-function and creating the appropriate operator.

Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).

Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
 return DirectFunctionCall2(_int_contains_sequence,
 PG_GETARG_DATUM(1),
 PG_GETARG_DATUM(0));
}
Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
 ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
 ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
 int na, nb;
 int32 *pa, *pb;
 int i, j;
 na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
 nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
 pa = (int32 *) ARR_DATA_PTR(a);
 pb = (int32 *) ARR_DATA_PTR(b);
 /* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
 for (i = 0; i <= na - nb; ++i)
 {
 for (j = 0; j < nb; ++j)
 if (pa[i + j] != pb[j])
 break;
 if (j == nb)
 PG_RETURN_BOOL(true);
 }
 PG_RETURN_BOOL(false);
}

CREATE FUNCTION _int_contains_sequence(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE FUNCTION _int_sequence_contained(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE OPERATOR @@> (
 LEFTARG = _int4,
 RIGHTARG = _int4,
 PROCEDURE = _int_contains_sequence,
 COMMUTATOR = '<@@',
 RESTRICT = contsel,
 JOIN = contjoinsel
);
CREATE OPERATOR <@@ (
 LEFTARG = _int4,
 RIGHTARG = _int4,
 PROCEDURE = _int_sequence_contained,
 COMMUTATOR = '@@>',
 RESTRICT = contsel,
 JOIN = contjoinsel
);

Now you can filter rows like this.

SELECT * FROM sequences WHERE sequence @@> '{12, 742, 225, 547}'

I have conducted a little experiment to find how much faster this solution is.

CREATE TEMPORARY TABLE sequences AS
SELECT array_agg((random() * 10)::int4) AS sequence, g1 AS id
FROM generate_series(1, 100000) g1
 CROSS JOIN generate_series(1, 30) g2
GROUP BY g1;

EXPLAIN ANALYZE SELECT * FROM sequences
WHERE translate(cast(sequence as text), '{}',',,')
 LIKE '%' || translate(cast('{1,2,3,4}'as text), '{}',',,') || '%'
"Seq Scan on sequences (cost=0.00..7869.42 rows=28 width=36) (actual time=2.487..334.318 rows=251 loops=1)"
" Filter: (translate((sequence)::text, '{}'::text, ',,'::text) ~~ '%,1,2,3,4,%'::text)"
" Rows Removed by Filter: 99749"
"Planning time: 0.104 ms"
"Execution time: 334.365 ms"

EXPLAIN ANALYZE SELECT * FROM sequences WHERE sequence @@> '{1,2,3,4}'
"Seq Scan on sequences (cost=0.00..5752.01 rows=282 width=36) (actual time=0.178..20.792 rows=251 loops=1)"
" Filter: (sequence @@> '{1,2,3,4}'::integer[])"
" Rows Removed by Filter: 99749"
"Planning time: 0.091 ms"
"Execution time: 20.859 ms"

So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.

Sounds interesting, however I use either strings or the type numeric to represent my data because they may overflow bigint. It might be well to edit your answer to match the constraints of the question. Anyway, I will do a comparative performance which I'll post here.
I am not sure if it is a good practice to paste large blocks of code into answers as they are supposed to be minimal and verifiable. A generic array version of this function is four times longer and quite cumbersome. I have also tested it with numeric and text and the improvement ranged from 20 to 50 times depending on the length of arrays.
Yes, however it is necessary that the answers answered questions :-). Here, it seems to me that a answer complying with the constraints is interesting (because this aspect is a part of the question). Nonetheless, it may not be necessary to propose a generic version. Just a version with strings or numeric.
Anyway, I added the version for generic arrays since it would be nearly the same for any variable-length data type. But if you are really concerned about performance, you should stick with fixed-size data types like bigint.
I would love to do that. The problem is that some of my sequences overflow far beyond bigint, so it seems I have no choice. But if you have an idea, I'm interested :).

dnoeth dnoeth 4,20613 silver badges14 bronze badges · Answer 2 · 2015-07-29 07:11:13Z

You can easily find the subsequence when you cast the arrays to strings and replace the curly brackets with commas:

translate(cast(sequence as varchar(10000)), '{}',',,')
{1,3,17,25,377,424,242,1234} -> ',1,3,17,25,377,424,242,1234,'

Do the same for the array you're searching for and add a leading and trailing %:

'%' || translate(cast(searchedarray as varchar(10000)), '{}',',,') || '%'
{3, 17, 25, 377} -> '%,3,17,25,377,%'

Now you compare it using LIKE:

WHERE translate(cast(sequence as varchar(10000)), '{}',',,')
 LIKE '%' || translate(cast(searchedarray as varchar(10000)), '{}',',,') || '%'

Edit:

Fiddle is working again.

If the arrays are normalized into one row per value you can apply set based logic:

CREATE TABLE sequences
( id int NOT NULL,
 n int not null,
 val numeric not null
);
insert into sequences values( 1, 1,1 );
insert into sequences values( 1, 2,3 );
insert into sequences values( 1, 3,17 );
insert into sequences values( 1, 4,25 );
insert into sequences values( 1, 5,377 );
insert into sequences values( 1, 6,424 );
insert into sequences values( 1, 7,242 );
insert into sequences values( 1, 8,1234 );
insert into sequences values( 2, 1,5 );
insert into sequences values( 2, 2,7 );
insert into sequences values( 2, 3,12 );
insert into sequences values( 2, 4,742 );
insert into sequences values( 2, 5,225 );
insert into sequences values( 2, 6,547 );
insert into sequences values( 2, 7,2142 );
insert into sequences values( 2, 8,223 );
insert into sequences values( 3, 1,3 );
insert into sequences values( 3, 2,4 );
insert into sequences values( 3, 3,12 );
insert into sequences values( 3, 4,5698 );
insert into sequences values( 3, 5,526 ); 
insert into sequences values( 4, 1,12 );
insert into sequences values( 4, 2,4 );
insert into sequences values( 4, 3,3 );
insert into sequences values( 4, 4,17 );
insert into sequences values( 4, 5,25 );
insert into sequences values( 4, 6,377 );
insert into sequences values( 4, 7,456 );
insert into sequences values( 4, 8,25 );
insert into sequences values( 5, 1,12 );
insert into sequences values( 5, 2,4 );
insert into sequences values( 5, 3,3 );
insert into sequences values( 5, 4,17 );
insert into sequences values( 5, 5,17 );
insert into sequences values( 5, 6,25 );
insert into sequences values( 5, 7,377 );
insert into sequences values( 5, 8,456 );
insert into sequences values( 5, 9,25 );

n must be sequential, no duplicates, no gaps. Now join on common values and exploit the fact that sequences are sequential :-)

with searched (n,val) as (
 VALUES
 ( 1,3 ),
 ( 2,17 ),
 ( 3,25 ),
 ( 4,377)
)
select seq.id, 
 -- this will return the same result if the values from both tables are in the same order
 -- it's a meaningless dummy, but the same meaningless value for sequential rows 
 seq.n - s.n as dummy,
 seq.val,
 seq.n,
 s.n 
from sequences as seq join searched as s
on seq.val = s.val
order by seq.id, dummy, seq.n;

Finally count the number of rows with the same dummy and check if it's the correct number:

with searched (n,val) as (
 VALUES
 ( 1,3 ),
 ( 2,17 ),
 ( 3,25 ),
 ( 4,377)
)
select distinct seq.id
from sequences as seq join searched as s
on seq.val = s.val
group by 
 seq.id,
 seq.n - s.n
having count(*) = (select count(*) from searched)
;

Try an index on sequences(val,id,n).

I also considered this solution afterwards. But I see several problems that seem quite bothersome: first of all I'm afraid that this solution is very inefficient, we must cast each array of each row before making a search pattern. It is possible to consider storing sequences in a TEXT field (varchar is a bad idea in my opinion, sequences can be long, as the numbers, so the size is rather unpredictable), to avoid cast; but it is still not possible to use indexes to improve performances (furthermore use a string field does not seem necessarily judicious, see comment of @CraigRinger above).
@mlpo: What's your performance expectation? To be able to use an index you must normalize the sequence into one row per value, apply a Relational Division and finally check if the order is correct. In your example 25 exists twice in id=4, is this actually possible? How many matches exists in average/maximum for a searched sequence?
A sequence may contain several times the same number. For example {1, 1, 1, 1, 12, 2, 2, 12, 12, 1, 1, 5, 4} is quite possible. Regarding the number of matches, the subsequences used are normally thought to limit the number of results. However, some sequences are very similar, and it can sometimes be interesting to use a subsequence shorter to get more results. I estimate that the number of matches for the majority of cases is between 0 and 100. With always the possibility that occasionally the subsequence match with a lot of sequences when it is short or very common.
@mlpo: I added a set-based solution and I would be very interested in some performance comparison :-)
@ypercube: This was just a quick add to return a more meaningful result :-) Ok, it's horrible, I'll change it.l

Stack Exchange Network

Find rows where integer sequence contains a given subsequence

Problem

Example

Investigations

Constraints

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find rows where integer sequence contains a given subsequence

Problem

Example

Investigations

Constraints

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions