Problem
Note: I refer to the mathematical sequences, not the sequences mechanism of PostgreSQL.
I have a table representing sequences of integers. The definition is:
CREATE TABLE sequences
(
id serial NOT NULL,
title character varying(255) NOT NULL,
date date NOT NULL,
sequence integer[] NOT NULL,
CONSTRAINT "PRIM_KEY_SEQUENCES" PRIMARY KEY (id)
);
My goal is to find rows using a given subsequence. That is to say, the rows where the sequence
field is a sequence that contains the given subsequence (in my case, the sequence is ordered).
Example
Suppose the table contains following data:
+----+-------+------------+-------------------------------+
| id | title | date | sequence |
+----+-------+------------+-------------------------------+
| 1 | BG703 | 2004年12月24日 | {1,3,17,25,377,424,242,1234} |
| 2 | BG256 | 2005年05月11日 | {5,7,12,742,225,547,2142,223} |
| 3 | BD404 | 2004年10月13日 | {3,4,12,5698,526} |
| 4 | BK956 | 2004年08月17日 | {12,4,3,17,25,377,456,25} |
+----+-------+------------+-------------------------------+
So if the given subsequence is {12, 742, 225, 547}
, I want to find the row 2.
Similarly, if the given subsequence is {3, 17, 25, 377}
, I want to find the row 1 and the row 4.
Finally, if the given subsequence is {12, 4, 3, 25, 377}
, then there are no rows returned.
Investigations
First, I am not completely certain that represent sequences with an array data type is wise. Although this seems appropriate to the situation; I fear it makes more complicated handling. Perhaps it is better to represent the sequences differently, using a model of relations with another table.
In the same way, I think about expand the sequences using the unnest
array function and then add my search criteria. Nevertheless, the number of terms in the sequence being variable I do not see how to do that.
I know it is also possible to cut my sequence in subsequence using the subarray
function of intarray module but I do not see how it benefit me for my search.
Constraints
Even if at the moment my model is still being developed, the table is intended to be composed of many sequences, between 50,000 and 300,000 rows. So I have a strong performance constraint.
In my example I used relatively small integers. In practice, it is possible that these integers become much larger, up to overflow bigint
. In such a situation, I think the best is to store numbers as strings (since it is not necessary to perform these sequences of mathematical operations). However, opting for this solution, this makes it impossible to use the intarray module, mentioned above.
2 Answers 2
If you are looking for significant performance improvements to dnoeth's answer, consider using a native C-function and creating the appropriate operator.
Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).
Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
return DirectFunctionCall2(_int_contains_sequence,
PG_GETARG_DATUM(1),
PG_GETARG_DATUM(0));
}
Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
int na, nb;
int32 *pa, *pb;
int i, j;
na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
pa = (int32 *) ARR_DATA_PTR(a);
pb = (int32 *) ARR_DATA_PTR(b);
/* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
for (i = 0; i <= na - nb; ++i)
{
for (j = 0; j < nb; ++j)
if (pa[i + j] != pb[j])
break;
if (j == nb)
PG_RETURN_BOOL(true);
}
PG_RETURN_BOOL(false);
}
CREATE FUNCTION _int_contains_sequence(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE FUNCTION _int_sequence_contained(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE OPERATOR @@> (
LEFTARG = _int4,
RIGHTARG = _int4,
PROCEDURE = _int_contains_sequence,
COMMUTATOR = '<@@',
RESTRICT = contsel,
JOIN = contjoinsel
);
CREATE OPERATOR <@@ (
LEFTARG = _int4,
RIGHTARG = _int4,
PROCEDURE = _int_sequence_contained,
COMMUTATOR = '@@>',
RESTRICT = contsel,
JOIN = contjoinsel
);
Now you can filter rows like this.
SELECT * FROM sequences WHERE sequence @@> '{12, 742, 225, 547}'
I have conducted a little experiment to find how much faster this solution is.
CREATE TEMPORARY TABLE sequences AS
SELECT array_agg((random() * 10)::int4) AS sequence, g1 AS id
FROM generate_series(1, 100000) g1
CROSS JOIN generate_series(1, 30) g2
GROUP BY g1;
EXPLAIN ANALYZE SELECT * FROM sequences
WHERE translate(cast(sequence as text), '{}',',,')
LIKE '%' || translate(cast('{1,2,3,4}'as text), '{}',',,') || '%'
"Seq Scan on sequences (cost=0.00..7869.42 rows=28 width=36) (actual time=2.487..334.318 rows=251 loops=1)"
" Filter: (translate((sequence)::text, '{}'::text, ',,'::text) ~~ '%,1,2,3,4,%'::text)"
" Rows Removed by Filter: 99749"
"Planning time: 0.104 ms"
"Execution time: 334.365 ms"
EXPLAIN ANALYZE SELECT * FROM sequences WHERE sequence @@> '{1,2,3,4}'
"Seq Scan on sequences (cost=0.00..5752.01 rows=282 width=36) (actual time=0.178..20.792 rows=251 loops=1)"
" Filter: (sequence @@> '{1,2,3,4}'::integer[])"
" Rows Removed by Filter: 99749"
"Planning time: 0.091 ms"
"Execution time: 20.859 ms"
So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.
-
Sounds interesting, however I use either strings or the type
numeric
to represent my data because they may overflowbigint
. It might be well to edit your answer to match the constraints of the question. Anyway, I will do a comparative performance which I'll post here.mlpo– mlpo2015年08月04日 18:57:09 +00:00Commented Aug 4, 2015 at 18:57 -
I am not sure if it is a good practice to paste large blocks of code into answers as they are supposed to be minimal and verifiable. A generic array version of this function is four times longer and quite cumbersome. I have also tested it with
numeric
andtext
and the improvement ranged from 20 to 50 times depending on the length of arrays.Slonopotamus– Slonopotamus2015年08月05日 04:19:14 +00:00Commented Aug 5, 2015 at 4:19 -
Yes, however it is necessary that the answers answered questions :-). Here, it seems to me that a answer complying with the constraints is interesting (because this aspect is a part of the question). Nonetheless, it may not be necessary to propose a generic version. Just a version with strings or
numeric
.mlpo– mlpo2015年08月05日 11:50:10 +00:00Commented Aug 5, 2015 at 11:50 -
Anyway, I added the version for generic arrays since it would be nearly the same for any variable-length data type. But if you are really concerned about performance, you should stick with fixed-size data types like
bigint
.Slonopotamus– Slonopotamus2015年08月05日 12:02:53 +00:00Commented Aug 5, 2015 at 12:02 -
I would love to do that. The problem is that some of my sequences overflow far beyond
bigint
, so it seems I have no choice. But if you have an idea, I'm interested :).mlpo– mlpo2015年08月05日 12:07:27 +00:00Commented Aug 5, 2015 at 12:07
You can easily find the subsequence when you cast the arrays to strings and replace the curly brackets with commas:
translate(cast(sequence as varchar(10000)), '{}',',,')
{1,3,17,25,377,424,242,1234} -> ',1,3,17,25,377,424,242,1234,'
Do the same for the array you're searching for and add a leading and trailing %
:
'%' || translate(cast(searchedarray as varchar(10000)), '{}',',,') || '%'
{3, 17, 25, 377} -> '%,3,17,25,377,%'
Now you compare it using LIKE
:
WHERE translate(cast(sequence as varchar(10000)), '{}',',,')
LIKE '%' || translate(cast(searchedarray as varchar(10000)), '{}',',,') || '%'
Edit:
Fiddle is working again.
If the arrays are normalized into one row per value you can apply set based logic:
CREATE TABLE sequences
( id int NOT NULL,
n int not null,
val numeric not null
);
insert into sequences values( 1, 1,1 );
insert into sequences values( 1, 2,3 );
insert into sequences values( 1, 3,17 );
insert into sequences values( 1, 4,25 );
insert into sequences values( 1, 5,377 );
insert into sequences values( 1, 6,424 );
insert into sequences values( 1, 7,242 );
insert into sequences values( 1, 8,1234 );
insert into sequences values( 2, 1,5 );
insert into sequences values( 2, 2,7 );
insert into sequences values( 2, 3,12 );
insert into sequences values( 2, 4,742 );
insert into sequences values( 2, 5,225 );
insert into sequences values( 2, 6,547 );
insert into sequences values( 2, 7,2142 );
insert into sequences values( 2, 8,223 );
insert into sequences values( 3, 1,3 );
insert into sequences values( 3, 2,4 );
insert into sequences values( 3, 3,12 );
insert into sequences values( 3, 4,5698 );
insert into sequences values( 3, 5,526 );
insert into sequences values( 4, 1,12 );
insert into sequences values( 4, 2,4 );
insert into sequences values( 4, 3,3 );
insert into sequences values( 4, 4,17 );
insert into sequences values( 4, 5,25 );
insert into sequences values( 4, 6,377 );
insert into sequences values( 4, 7,456 );
insert into sequences values( 4, 8,25 );
insert into sequences values( 5, 1,12 );
insert into sequences values( 5, 2,4 );
insert into sequences values( 5, 3,3 );
insert into sequences values( 5, 4,17 );
insert into sequences values( 5, 5,17 );
insert into sequences values( 5, 6,25 );
insert into sequences values( 5, 7,377 );
insert into sequences values( 5, 8,456 );
insert into sequences values( 5, 9,25 );
n
must be sequential, no duplicates, no gaps. Now join on common values and exploit the fact that sequences are sequential :-)
with searched (n,val) as (
VALUES
( 1,3 ),
( 2,17 ),
( 3,25 ),
( 4,377)
)
select seq.id,
-- this will return the same result if the values from both tables are in the same order
-- it's a meaningless dummy, but the same meaningless value for sequential rows
seq.n - s.n as dummy,
seq.val,
seq.n,
s.n
from sequences as seq join searched as s
on seq.val = s.val
order by seq.id, dummy, seq.n;
Finally count the number of rows with the same dummy and check if it's the correct number:
with searched (n,val) as (
VALUES
( 1,3 ),
( 2,17 ),
( 3,25 ),
( 4,377)
)
select distinct seq.id
from sequences as seq join searched as s
on seq.val = s.val
group by
seq.id,
seq.n - s.n
having count(*) = (select count(*) from searched)
;
Try an index on sequences(val,id,n).
-
I also considered this solution afterwards. But I see several problems that seem quite bothersome: first of all I'm afraid that this solution is very inefficient, we must cast each array of each row before making a search pattern. It is possible to consider storing sequences in a
TEXT
field (varchar
is a bad idea in my opinion, sequences can be long, as the numbers, so the size is rather unpredictable), to avoid cast; but it is still not possible to use indexes to improve performances (furthermore use a string field does not seem necessarily judicious, see comment of @CraigRinger above).mlpo– mlpo2015年07月29日 12:57:18 +00:00Commented Jul 29, 2015 at 12:57 -
@mlpo: What's your performance expectation? To be able to use an index you must normalize the sequence into one row per value, apply a Relational Division and finally check if the order is correct. In your example
25
exists twice inid=4
, is this actually possible? How many matches exists in average/maximum for a searched sequence?dnoeth– dnoeth2015年07月29日 13:28:27 +00:00Commented Jul 29, 2015 at 13:28 -
A sequence may contain several times the same number. For example
{1, 1, 1, 1, 12, 2, 2, 12, 12, 1, 1, 5, 4}
is quite possible. Regarding the number of matches, the subsequences used are normally thought to limit the number of results. However, some sequences are very similar, and it can sometimes be interesting to use a subsequence shorter to get more results. I estimate that the number of matches for the majority of cases is between 0 and 100. With always the possibility that occasionally the subsequence match with a lot of sequences when it is short or very common.mlpo– mlpo2015年07月29日 13:39:04 +00:00Commented Jul 29, 2015 at 13:39 -
@mlpo: I added a set-based solution and I would be very interested in some performance comparison :-)dnoeth– dnoeth2015年07月29日 14:44:08 +00:00Commented Jul 29, 2015 at 14:44
-
@ypercube: This was just a quick add to return a more meaningful result :-) Ok, it's horrible, I'll change it.ldnoeth– dnoeth2015年07月29日 15:07:03 +00:00Commented Jul 29, 2015 at 15:07
Explore related questions
See similar questions with these tags.
bigint
you should usenumeric
as the type to store them. It's a lot slower and takes way more room though.numeric
and not a string (text
for example)? I do not need to perform mathematical operations on my sequences.text
, and prevents you from storing bogus non-numeric data. Depends, if you're only doing I/O you might want text to reduce I/O processing.SELECT ARRAY[12, 4, 3, 17, 25, 377, 456, 25] @> ARRAY[12, 4, 3, 25, 377];
will return true, because the order is not considered by this operator.