I've got about a billion rows of data in a table with a name and an integer in the range 1-288. For a given name, every int is unique, and not every possible integer in the range is present--so there are gaps.
This query generates an example case:
--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
('foo', 3),
('foo', 4),
('foo', 10),
('foo', 11),
('foo', 13),
('bar', 1),
('bar', 2),
('bar', 3)
) AS baz ("name", "int")
I'd like to generate a lookup table with a row for each name and sequence of contiguous integers. Each such row would contain:
name -- the value of the name column
start -- the first integer in the contiguous sequence
end -- the final value in the contiguous sequence
span -- end - start + 1
This query generates example output for the above example:
--what I need:
SELECT *
FROM ( VALUES ('foo', 2, 4, 3),
('foo', 10, 11, 2),
('foo', 13, 13, 1),
('bar', 1, 3, 3)
) AS contiguous_ranges ("name", "start", "end", span)
Because I have so many rows, more efficient is better. That said, I only have to run this query once, so it isn't an absolute requirement.
Thanks in advance!
Edit:
I should add that PL/pgSQL solutions are welcome (please explain any Fancy Tricks--I'm still new to PL/pgSQL).
-
I would find a way to process the table in small enough chunks (maybe by hashing the "name" into N buckets, or taking the first/last letter of the name), so that a sort fits in memory. It is likely that scanning the table several tables will be faster than letting a sort spill to disk. Once I had that, I would go with using the windowing functions. Also, don't forget to exploit patterns in the data. Maybe most of the "name" actually have a count of 288 values, in which case you could exclude those values from the main process. End of random rambling :)Ronnis– Ronnis2011年03月24日 23:13:53 +00:00Commented Mar 24, 2011 at 23:13
-
great - and welcome to the site. Did you have any luck with the solutions provided?Jack Douglas– Jack Douglas2012年04月27日 16:10:46 +00:00Commented Apr 27, 2012 at 16:10
-
thank you. I actually changed projects shortly after posting this question (and shortly thereafter, I changed jobs), so I never had a chance to test these solutions. what should I do in terms of selecting an answer in such a case?Stew– Stew2012年04月27日 20:38:07 +00:00Commented Apr 27, 2012 at 20:38
7 Answers 7
How about using with recursive
test view:
create view v as
select *
from ( values ('foo', 2),
('foo', 3),
('foo', 4),
('foo', 10),
('foo', 11),
('foo', 13),
('bar', 1),
('bar', 2),
('bar', 3)
) as baz ("name", "int");
query:
with recursive t("name", "int") as ( select "name", "int", 1 as span from v
union all
select "name", v."int", t.span+1 as span
from v join t using ("name")
where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
from ( select "name", "int", max(span) as span
from t
group by "name", "int" ) z
group by "name", ("int"-span+1) ) z;
result:
name | start | end | span
------+-------+-----+------
foo | 2 | 4 | 3
foo | 13 | 13 | 1
bar | 1 | 3 | 3
foo | 10 | 11 | 2
(4 rows)
I'd be interested to know how that performs on your billion row table.
-
If performance is an issue, playing with the settings for work_mem might help to improve performance.Frank Heikens– Frank Heikens2012年06月02日 15:11:20 +00:00Commented Jun 2, 2012 at 15:11
You can do it with windowing functions. The basic idea is to use lead
and lag
windowing functions to pull rows ahead and behind the current row. Then we can compute if we have the start or end of sequence:
create temp view temp_view as
select
n,
val,
(lead <> val + 1 or lead is null) as islast,
(lag <> val - 1 or lag is null) as isfirst,
(lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan
from
(
select
n,
lead(val, 1) over( partition by n order by n, val),
lag(val, 1) over(partition by n order by n, val ),
val
from test
order by n, val
) as t
;
select * from temp_view;
n | val | islast | isfirst | orphan
-----+-----+--------+---------+--------
bar | 1 | f | t | f
bar | 2 | f | f | f
bar | 3 | t | f | f
bar | 24 | t | t | t
bar | 42 | t | t | t
foo | 2 | f | t | f
foo | 3 | f | f | f
foo | 4 | t | f | f
foo | 10 | f | t | f
foo | 11 | t | f | f
foo | 13 | t | t | t
(11 rows)
(I used a view so the logic will be easier to follow below.) So now we know if the row is a beginning or an end. We have to collapse that into row:
select
n as "name",
first,
coalesce (last, first) as last,
coalesce (last - first + 1, 1) as span
from
(
select
n,
val as first,
-- this will not be excellent perf. since were calling the view
-- for each row sequence found. Changing view into temp table
-- will probably help with lots of values.
(
select min(val)
from temp_view as last
where islast = true
-- need this since isfirst=true, islast=true on an orphan sequence
and last.orphan = false
and first.val < last.val
and first.n = last.n
) as last
from
(select * from temp_view where isfirst = true) as first
) as t
;
name | first | last | span
------+-------+------+------
bar | 1 | 3 | 3
bar | 24 | 24 | 1
bar | 42 | 42 | 1
foo | 2 | 4 | 3
foo | 10 | 11 | 2
foo | 13 | 13 | 1
(6 rows)
Looks correct to me :)
Another window function solution. No idea about efficiency, I've added the execution plan at the end (although with so few rows, it probably is not of much value). If you want to play around: SQL-Fiddle test
Table and data:
CREATE TABLE baz
( name VARCHAR(10) NOT NULL
, i INT NOT NULL
, UNIQUE (name, i)
) ;
INSERT INTO baz
VALUES
('foo', 2),
('foo', 3),
('foo', 4),
('foo', 10),
('foo', 11),
('foo', 13),
('bar', 1),
('bar', 2),
('bar', 3)
;
Query:
SELECT a.name AS name
, a.i AS start
, b.i AS "end"
, b.i-a.i+1 AS span
FROM
( SELECT name, i
, ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn
FROM baz AS a
WHERE NOT EXISTS
( SELECT *
FROM baz AS prev
WHERE prev.name = a.name
AND prev.i = a.i - 1
)
) AS a
JOIN
( SELECT name, i
, ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn
FROM baz AS a
WHERE NOT EXISTS
( SELECT *
FROM baz AS next
WHERE next.name = a.name
AND next.i = a.i + 1
)
) AS b
ON b.name = a.name
AND b.rn = a.rn
;
Query Plan
Merge Join (cost=442.74..558.76 rows=18 width=46)
Merge Cond: ((a.name)::text = (a.name)::text)
Join Filter: ((row_number() OVER (?)) = (row_number() OVER (?)))
-> WindowAgg (cost=221.37..238.33 rows=848 width=42)
-> Sort (cost=221.37..223.49 rows=848 width=42)
Sort Key: a.name, a.i
-> Merge Anti Join (cost=157.21..180.13 rows=848 width=42)
Merge Cond: (((a.name)::text = (prev.name)::text) AND (((a.i - 1)) = prev.i))
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: a.name, ((a.i - 1))
-> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42)
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: prev.name, prev.i
-> Seq Scan on baz prev (cost=0.00..21.30 rows=1130 width=42)
-> Materialize (cost=221.37..248.93 rows=848 width=50)
-> WindowAgg (cost=221.37..238.33 rows=848 width=42)
-> Sort (cost=221.37..223.49 rows=848 width=42)
Sort Key: a.name, a.i
-> Merge Anti Join (cost=157.21..180.13 rows=848 width=42)
Merge Cond: (((a.name)::text = (next.name)::text) AND (((a.i + 1)) = next.i))
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: a.name, ((a.i + 1))
-> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42)
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: next.name, next.i
-> Seq Scan on baz next (cost=0.00..21.30 rows=1130 width=42)
On SQL Server, I would add one more column named previousInt:
SELECT *
FROM ( VALUES ('foo', 2, NULL),
('foo', 3, 2),
('foo', 4, 3),
('foo', 10, 4),
('foo', 11, 10),
('foo', 13, 11),
('bar', 1, NULL),
('bar', 2, 1),
('bar', 3, 2)
) AS baz ("name", "int", "previousInt")
I would use a CHECK constraint to make sure that previousInt < int, and a FK constraint (name, previousInt) refer to (name, int), and a couple more constraints to ensure watertight data integrity. That done, selecting gaps is trivial:
SELECT NAME, PreviousInt, Int from YourTable WHERE PreviousInt < Int - 1;
To speed it up, I might create a filtered index that would include only gaps. This means that all your gaps are precomputed, so selects are very fast, and constraints ensure the integrity of your precomputed data. I am using such solutions a lot, they are all over my system.
You can look for Tabibitosan Method:
https://community.oracle.com/docs/DOC-915680
http://rwijk.blogspot.com/2014/01/tabibitosan.html
https://www.xaprb.com/blog/2006/03/22/find-contiguous-ranges-with-sql/
Basically:
SQL> create table mytable (nr)
2 as
3 select 1 from dual union all
4 select 2 from dual union all
5 select 3 from dual union all
6 select 6 from dual union all
7 select 7 from dual union all
8 select 11 from dual union all
9 select 18 from dual union all
10 select 19 from dual union all
11 select 20 from dual union all
12 select 21 from dual union all
13 select 22 from dual union all
14 select 25 from dual
15 /
Table created.
SQL> with tabibitosan as
2 ( select nr
3 , nr - row_number() over (order by nr) grp
4 from mytable
5 )
6 select min(nr)
7 , max(nr)
8 from tabibitosan
9 group by grp
10 order by grp
11 /
MIN(NR) MAX(NR)
---------- ----------
1 3
6 7
11 11
18 22
25 25
5 rows selected.
I think this performance better:
SQL> r
1 select min(nr) as range_start
2 ,max(nr) as range_end
3 from (-- our previous query
4 select nr
5 ,rownum
6 ,nr - rownum grp
7 from (select nr
8 from mytable
9 order by 1
10 )
11 )
12 group by grp
13* order by 1
RANGE_START RANGE_END
----------- ----------
1 3
6 7
11 11
18 22
25 25
a rough plan:
- Select the minimum for each name, (group by name)
- Select the minimum2 for each name, where min2> min1 and not exists (subquery: SEL min2-1 ).
- Sel max val1> min val1 where max val1 < min val2.
Repeat from 2. until no more update happens. From there it gets complicated, Gordian, with grouping over max of mins and min of max. I guess I would go for a programming language.
P.S.: A nice sample table with a few sample values would be fine, which could be used by everyone, so not everybody creates his testdata from scratch.
This solution is inspired from nate c's answer using windowing functions and the OVER clause. Interestingly enough, that answer reverts back to subqueries with external references. It is possible to complete the row consolidation using another level of windowing functions. It may not look too pretty, but I assume it's more efficient since it utilizes built-in logic of the powerful windowing functions.
I realized from nate's solution that the initial set of rows already prodcued necessary flags to 1) select the starting & ending range values AND 2) to eliminate the extra rows in between. The query has nested subqueries two deep only due to limitations of the windowing functions, which restrict how column aliases can be used. Logically I could have produced the results with just one nested subquery.
A few other notes: The following is code for SQLite3. The SQLite dialect is derived from postgresql, so it is very similar and might even work unaltered. I added framing restriction to the OVER clauses, since the lag()
and lead()
functions only need a single-row window, before and after respectively (so there was no need to keep the default set of all preceding rows). I also opted for the names first
and last
since the word end
is reserved.
create temp view test as
with cte(name, int) AS (
select * from ( values ('foo', 2),
('foo', 3),
('foo', 4),
('foo', 10),
('foo', 11),
('foo', 13),
('bar', 1),
('bar', 2),
('bar', 3) ))
select * from cte;
SELECT name,
int AS first,
endpoint AS last,
(endpoint - int + 1) AS span
FROM ( SELECT name,
int,
CASE WHEN prev <> 1 AND next <> -1 -- orphan
THEN int
WHEN next = -1 -- start of range
THEN lead(int) OVER (PARTITION BY name
ORDER BY int
ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)
ELSE null END
AS endpoint
FROM ( SELECT name,
int,
coalesce(int - lag(int) OVER (PARTITION BY name
ORDER BY int
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW),
0) AS prev,
coalesce(int - lead(int) OVER (PARTITION BY name
ORDER BY int
ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING),
0) AS next
FROM test
) AS mark_boundaries
WHERE NOT (prev = 1 AND next = -1) -- discard values within range
) as raw_ranges
WHERE endpoint IS NOT null
ORDER BY name, first
The results are just like the other answers, as one expects:
name | first | last | span
------+-------+------+------
bar | 1 | 3 | 3
foo | 2 | 4 | 3
foo | 10 | 11 | 2
foo | 13 | 13 | 1