21

I've got about a billion rows of data in a table with a name and an integer in the range 1-288. For a given name, every int is unique, and not every possible integer in the range is present--so there are gaps.

This query generates an example case:

--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
 ('foo', 3),
 ('foo', 4),
 ('foo', 10),
 ('foo', 11),
 ('foo', 13),
 ('bar', 1),
 ('bar', 2),
 ('bar', 3)
 ) AS baz ("name", "int")

I'd like to generate a lookup table with a row for each name and sequence of contiguous integers. Each such row would contain:

name -- the value of the name column
start -- the first integer in the contiguous sequence
end -- the final value in the contiguous sequence
span -- end - start + 1

This query generates example output for the above example:

--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
 ('foo', 10, 11, 2),
 ('foo', 13, 13, 1),
 ('bar', 1, 3, 3)
 ) AS contiguous_ranges ("name", "start", "end", span)

Because I have so many rows, more efficient is better. That said, I only have to run this query once, so it isn't an absolute requirement.

Thanks in advance!

Edit:

I should add that PL/pgSQL solutions are welcome (please explain any Fancy Tricks--I'm still new to PL/pgSQL).

asked Mar 24, 2011 at 18:57
3
  • I would find a way to process the table in small enough chunks (maybe by hashing the "name" into N buckets, or taking the first/last letter of the name), so that a sort fits in memory. It is likely that scanning the table several tables will be faster than letting a sort spill to disk. Once I had that, I would go with using the windowing functions. Also, don't forget to exploit patterns in the data. Maybe most of the "name" actually have a count of 288 values, in which case you could exclude those values from the main process. End of random rambling :) Commented Mar 24, 2011 at 23:13
  • great - and welcome to the site. Did you have any luck with the solutions provided? Commented Apr 27, 2012 at 16:10
  • thank you. I actually changed projects shortly after posting this question (and shortly thereafter, I changed jobs), so I never had a chance to test these solutions. what should I do in terms of selecting an answer in such a case? Commented Apr 27, 2012 at 20:38

7 Answers 7

10

How about using with recursive

test view:

create view v as 
select *
from ( values ('foo', 2),
 ('foo', 3),
 ('foo', 4),
 ('foo', 10),
 ('foo', 11),
 ('foo', 13),
 ('bar', 1),
 ('bar', 2),
 ('bar', 3)
 ) as baz ("name", "int");

query:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
 union all
 select "name", v."int", t.span+1 as span
 from v join t using ("name")
 where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
 from ( select "name", "int", max(span) as span 
 from t
 group by "name", "int" ) z
 group by "name", ("int"-span+1) ) z;

result:

 name | start | end | span
------+-------+-----+------
 foo | 2 | 4 | 3
 foo | 13 | 13 | 1
 bar | 1 | 3 | 3
 foo | 10 | 11 | 2
(4 rows)

I'd be interested to know how that performs on your billion row table.

answered Mar 25, 2011 at 15:45
1
  • If performance is an issue, playing with the settings for work_mem might help to improve performance. Commented Jun 2, 2012 at 15:11
9

You can do it with windowing functions. The basic idea is to use lead and lag windowing functions to pull rows ahead and behind the current row. Then we can compute if we have the start or end of sequence:

create temp view temp_view as
 select
 n,
 val,
 (lead <> val + 1 or lead is null) as islast,
 (lag <> val - 1 or lag is null) as isfirst,
 (lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan
 from
 (
 select
 n,
 lead(val, 1) over( partition by n order by n, val),
 lag(val, 1) over(partition by n order by n, val ),
 val
 from test
 order by n, val
 ) as t
; 
select * from temp_view;
 n | val | islast | isfirst | orphan 
-----+-----+--------+---------+--------
 bar | 1 | f | t | f
 bar | 2 | f | f | f
 bar | 3 | t | f | f
 bar | 24 | t | t | t
 bar | 42 | t | t | t
 foo | 2 | f | t | f
 foo | 3 | f | f | f
 foo | 4 | t | f | f
 foo | 10 | f | t | f
 foo | 11 | t | f | f
 foo | 13 | t | t | t
(11 rows)

(I used a view so the logic will be easier to follow below.) So now we know if the row is a beginning or an end. We have to collapse that into row:

select
 n as "name",
 first,
 coalesce (last, first) as last,
 coalesce (last - first + 1, 1) as span
from
(
 select
 n,
 val as first,
 -- this will not be excellent perf. since were calling the view
 -- for each row sequence found. Changing view into temp table 
 -- will probably help with lots of values.
 (
 select min(val)
 from temp_view as last
 where islast = true
 -- need this since isfirst=true, islast=true on an orphan sequence
 and last.orphan = false
 and first.val < last.val
 and first.n = last.n
 ) as last
 from
 (select * from temp_view where isfirst = true) as first
) as t
;
 name | first | last | span 
------+-------+------+------
 bar | 1 | 3 | 3
 bar | 24 | 24 | 1
 bar | 42 | 42 | 1
 foo | 2 | 4 | 3
 foo | 10 | 11 | 2
 foo | 13 | 13 | 1
(6 rows)

Looks correct to me :)

answered Mar 24, 2011 at 20:17
4

Another window function solution. No idea about efficiency, I've added the execution plan at the end (although with so few rows, it probably is not of much value). If you want to play around: SQL-Fiddle test

Table and data:

CREATE TABLE baz
( name VARCHAR(10) NOT NULL
, i INT NOT NULL
, UNIQUE (name, i)
) ;
INSERT INTO baz
 VALUES 
 ('foo', 2),
 ('foo', 3),
 ('foo', 4),
 ('foo', 10),
 ('foo', 11),
 ('foo', 13),
 ('bar', 1),
 ('bar', 2),
 ('bar', 3)
 ;

Query:

SELECT a.name AS name
 , a.i AS start
 , b.i AS "end"
 , b.i-a.i+1 AS span
FROM
 ( SELECT name, i
 , ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn
 FROM baz AS a
 WHERE NOT EXISTS
 ( SELECT * 
 FROM baz AS prev
 WHERE prev.name = a.name
 AND prev.i = a.i - 1
 ) 
 ) AS a
 JOIN
 ( SELECT name, i 
 , ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn
 FROM baz AS a
 WHERE NOT EXISTS
 ( SELECT * 
 FROM baz AS next
 WHERE next.name = a.name
 AND next.i = a.i + 1
 )
 ) AS b
 ON b.name = a.name
 AND b.rn = a.rn
 ; 

Query Plan

Merge Join (cost=442.74..558.76 rows=18 width=46)
Merge Cond: ((a.name)::text = (a.name)::text)
Join Filter: ((row_number() OVER (?)) = (row_number() OVER (?)))
-> WindowAgg (cost=221.37..238.33 rows=848 width=42)
-> Sort (cost=221.37..223.49 rows=848 width=42)
Sort Key: a.name, a.i
-> Merge Anti Join (cost=157.21..180.13 rows=848 width=42)
Merge Cond: (((a.name)::text = (prev.name)::text) AND (((a.i - 1)) = prev.i))
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: a.name, ((a.i - 1))
-> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42)
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: prev.name, prev.i
-> Seq Scan on baz prev (cost=0.00..21.30 rows=1130 width=42)
-> Materialize (cost=221.37..248.93 rows=848 width=50)
-> WindowAgg (cost=221.37..238.33 rows=848 width=42)
-> Sort (cost=221.37..223.49 rows=848 width=42)
Sort Key: a.name, a.i
-> Merge Anti Join (cost=157.21..180.13 rows=848 width=42)
Merge Cond: (((a.name)::text = (next.name)::text) AND (((a.i + 1)) = next.i))
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: a.name, ((a.i + 1))
-> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42)
-> Sort (cost=78.60..81.43 rows=1130 width=42)
Sort Key: next.name, next.i
-> Seq Scan on baz next (cost=0.00..21.30 rows=1130 width=42)
answered Apr 25, 2012 at 19:55
4

On SQL Server, I would add one more column named previousInt:

SELECT *
FROM ( VALUES ('foo', 2, NULL),
 ('foo', 3, 2),
 ('foo', 4, 3),
 ('foo', 10, 4),
 ('foo', 11, 10),
 ('foo', 13, 11),
 ('bar', 1, NULL),
 ('bar', 2, 1),
 ('bar', 3, 2)
 ) AS baz ("name", "int", "previousInt")

I would use a CHECK constraint to make sure that previousInt < int, and a FK constraint (name, previousInt) refer to (name, int), and a couple more constraints to ensure watertight data integrity. That done, selecting gaps is trivial:

SELECT NAME, PreviousInt, Int from YourTable WHERE PreviousInt < Int - 1;

To speed it up, I might create a filtered index that would include only gaps. This means that all your gaps are precomputed, so selects are very fast, and constraints ensure the integrity of your precomputed data. I am using such solutions a lot, they are all over my system.

answered Jun 2, 2012 at 15:07
2

You can look for Tabibitosan Method:

https://community.oracle.com/docs/DOC-915680
http://rwijk.blogspot.com/2014/01/tabibitosan.html
https://www.xaprb.com/blog/2006/03/22/find-contiguous-ranges-with-sql/

Basically:

SQL> create table mytable (nr)
 2 as
 3 select 1 from dual union all
 4 select 2 from dual union all
 5 select 3 from dual union all
 6 select 6 from dual union all
 7 select 7 from dual union all
 8 select 11 from dual union all
 9 select 18 from dual union all
 10 select 19 from dual union all
 11 select 20 from dual union all
 12 select 21 from dual union all
 13 select 22 from dual union all
 14 select 25 from dual
 15 /
 Table created.
 SQL> with tabibitosan as
 2 ( select nr
 3 , nr - row_number() over (order by nr) grp
 4 from mytable
 5 )
 6 select min(nr)
 7 , max(nr)
 8 from tabibitosan
 9 group by grp
10 order by grp
11 /
 MIN(NR) MAX(NR)
---------- ----------
 1 3
 6 7
 11 11
 18 22
 25 25
5 rows selected.

I think this performance better:

SQL> r
 1 select min(nr) as range_start
 2 ,max(nr) as range_end
 3 from (-- our previous query
 4 select nr
 5 ,rownum
 6 ,nr - rownum grp
 7 from (select nr
 8 from mytable
 9 order by 1
 10 )
 11 )
 12 group by grp
 13* order by 1
RANGE_START RANGE_END
----------- ----------
 1 3
 6 7
 11 11
 18 22
 25 25
answered Jul 30, 2018 at 13:25
0
0

a rough plan:

  • Select the minimum for each name, (group by name)
  • Select the minimum2 for each name, where min2> min1 and not exists (subquery: SEL min2-1 ).
  • Sel max val1> min val1 where max val1 < min val2.

Repeat from 2. until no more update happens. From there it gets complicated, Gordian, with grouping over max of mins and min of max. I guess I would go for a programming language.

P.S.: A nice sample table with a few sample values would be fine, which could be used by everyone, so not everybody creates his testdata from scratch.

answered Mar 24, 2011 at 19:13
0

This solution is inspired from nate c's answer using windowing functions and the OVER clause. Interestingly enough, that answer reverts back to subqueries with external references. It is possible to complete the row consolidation using another level of windowing functions. It may not look too pretty, but I assume it's more efficient since it utilizes built-in logic of the powerful windowing functions.

I realized from nate's solution that the initial set of rows already prodcued necessary flags to 1) select the starting & ending range values AND 2) to eliminate the extra rows in between. The query has nested subqueries two deep only due to limitations of the windowing functions, which restrict how column aliases can be used. Logically I could have produced the results with just one nested subquery.

A few other notes: The following is code for SQLite3. The SQLite dialect is derived from postgresql, so it is very similar and might even work unaltered. I added framing restriction to the OVER clauses, since the lag() and lead() functions only need a single-row window, before and after respectively (so there was no need to keep the default set of all preceding rows). I also opted for the names first and last since the word end is reserved.

create temp view test as 
with cte(name, int) AS (
select * from ( values ('foo', 2),
 ('foo', 3),
 ('foo', 4),
 ('foo', 10),
 ('foo', 11),
 ('foo', 13),
 ('bar', 1),
 ('bar', 2),
 ('bar', 3) ))
select * from cte;
SELECT name,
 int AS first, 
 endpoint AS last,
 (endpoint - int + 1) AS span
FROM ( SELECT name, 
 int, 
 CASE WHEN prev <> 1 AND next <> -1 -- orphan
 THEN int
 WHEN next = -1 -- start of range
 THEN lead(int) OVER (PARTITION BY name 
 ORDER BY int 
 ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)
 ELSE null END
 AS endpoint
 FROM ( SELECT name, 
 int,
 coalesce(int - lag(int) OVER (PARTITION BY name 
 ORDER BY int 
 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 
 0) AS prev,
 coalesce(int - lead(int) OVER (PARTITION BY name 
 ORDER BY int 
 ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING),
 0) AS next
 FROM test
 ) AS mark_boundaries
 WHERE NOT (prev = 1 AND next = -1) -- discard values within range
 ) as raw_ranges
WHERE endpoint IS NOT null
ORDER BY name, first

The results are just like the other answers, as one expects:

 name | first | last | span
------+-------+------+------
 bar | 1 | 3 | 3
 foo | 2 | 4 | 3
 foo | 10 | 11 | 2
 foo | 13 | 13 | 1
answered Jun 1, 2019 at 7:41

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.