List chronological pageviews without adjacent duplicates in PostgreSQL

Question 1

I've got a log of pageviews. Multiple pageviews with the same path may occur next to each other if say the user refreshes the page.

To see the user's journey through the site I'd like to select pageviews for that user but eliminating duplicate adjacent pageviews for clarity.

Example simplified table (time is really a timestamp):

CREATE TABLE pageviews(user_id, time, path)
AS VALUES
 ( 1, 1, '/home' ),
 ( 2, 2, '/home' ),
 ( 1, 3, '/profile' ),
 ( 1, 4, '/profile' ),
 ( 2, 5, '/dashboard' ),
 ( 1, 6, '/home' ),
 ( 2, 7, '/profile' ),
 ( 1, 8, '/profile' );

Now assume there's going to be a WHERE user_id = 1 clause and ORDER BY time ASC so we have:

( 1, 1, '/home' ),
( 1, 3, '/profile' ),
( 1, 4, '/profile' ),
( 1, 6, '/home' ),
( 1, 8, '/profile' );

But what I'd like is the duplicate adjacent /profile to be ignored like so (a count of the "grouped" paths would be nice too):

| user_id | time | path | count |
| ------- | ---- | -------- | ----- |
| 1 | 1 | /home | 1 |
| 1 | 4 | /profile | 2 |
| 1 | 6 | /home | 1 |
| 1 | 8 | /profile | 1 |

A simple GROUP BY wouldn't do it as /profile is still in there twice, I only want to "group" adjacent equal values of path.

Also I'm not sure whether the first or last of the equal values should be returned, I don't really mind so whichever makes sense / makes for the simplest query. I'll probably flip it to be DESC in real use but this is how I've written the tables.

Reference I found this post that does basically what I want (included below for permanence). So I'm interested to hear 2 improvements:

Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.
Can a count of the "grouped" values be added easily (possibly using a window function)?

So any elegant query that can achieve both of those things would be ⭐️ I couldn't find any other articles or questions about this but I feel it must be a common thing to do so any other links welcome.

SELECT w1.day, w1.rainy
FROM (SELECT
 w2.day, w2.rainy,
 lead(w2.rainy) OVER (ORDER BY w2.day DESC) as prev_rainy
 FROM weather w2
 ORDER BY w2.day DESC
) as w1
WHERE w1.rainy IS DISTINCT FROM w1.prev_rainy
ORDER BY w1.day DESC;

Question 2

This is called a gaps-and-islands problem.

Question 3

SELECT user_id, max(time) AS time, path, count(*)
FROM (
 SELECT *,
 count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
 FROM (
 SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
 FROM pageviews
 ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;

First what you need to do is create something you can group by, for this we compute is_resets. For more information about why it's in a CASE, see this question In what case is a count(x or null) needed in Gaps and Islands?

SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
 user_id | time | path | is_reset 
---------+------+------------+----------
 1 | 1 | /home | 
 1 | 3 | /profile | 1
 1 | 4 | /profile | 
 1 | 6 | /home | 1
 1 | 8 | /profile | 1
 2 | 2 | /home | 
 2 | 5 | /dashboard | 1
 2 | 7 | /profile | 1
(8 rows)

Then we count them to create groups.

SELECT *,
 count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
FROM (
 SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
 FROM pageviews
) AS t1
 user_id | time | path | is_reset | grp 
---------+------+------------+----------+-----
 1 | 1 | /home | | 0
 1 | 3 | /profile | 1 | 1
 1 | 4 | /profile | | 1
 1 | 6 | /home | 1 | 2
 1 | 8 | /profile | 1 | 3
 2 | 2 | /home | | 0
 2 | 5 | /dashboard | 1 | 1
 2 | 7 | /profile | 1 | 2
(8 rows)

Now we group by the user_id, and grp. We select the max(time) because you didn't specify which time to choose in your question.

SELECT user_id, max(time) AS time, path, count(*)
FROM (
 SELECT *,
 count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
 FROM (
 SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
 FROM pageviews
 ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;
 user_id | time | path | count 
---------+------+------------+-------
 1 | 1 | /home | 1
 1 | 4 | /profile | 2
 1 | 6 | /home | 1
 1 | 8 | /profile | 1
 2 | 2 | /home | 1
 2 | 5 | /dashboard | 1
 2 | 7 | /profile | 1
(7 rows)

Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.

This problem can also be solved with dense_rank, as noted. You may want to see my answer to this question for a good write up on how to do that.

Solving "Gaps and Islands" with row_number() and dense_rank()?

Question 4

The answer from Evan Carroll gave me what I needed. I tried the approach he provided (modified to select just one user_id):

SELECT min(time) AS time, path, count(*)
FROM (
 SELECT *, count(is_reset) OVER (ORDER BY time DESC) AS grp
 FROM (
 SELECT *, CASE WHEN path <> lag(path) OVER (ORDER BY time DESC) THEN 1 END AS is_reset
 FROM pageviews WHERE user_id = 1
 ) AS t1
) AS t2
GROUP BY grp, path
ORDER BY time DESC;

And via the links he provided I was finally able to get a window function version to work:

SELECT min(time) AS time, path, count(grp) FROM (
SELECT *, row_number() OVER (ORDER BY time DESC) -
 row_number() OVER (PARTITION BY path ORDER BY time DESC) AS grp
FROM pageviews WHERE user_id = 26455
) i
GROUP BY path, grp
ORDER BY time DESC;

On my real dataset performance is about the same with both taking ~ 0.5s, and the query plans came out similar though not identical. So I used the latter window function version as it's shorter with fewer nested selects, though I think the prior lag version may be easier to understand.

Evan Carroll Evan Carroll 65.7k50 gold badges259 silver badges510 bronze badges · Accepted Answer · 2017-06-12 16:59:00Z

SELECT user_id, max(time) AS time, path, count(*)
FROM (
 SELECT *,
 count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
 FROM (
 SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
 FROM pageviews
 ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;

First what you need to do is create something you can group by, for this we compute is_resets. For more information about why it's in a CASE, see this question In what case is a count(x or null) needed in Gaps and Islands?

SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
 user_id | time | path | is_reset 
---------+------+------------+----------
 1 | 1 | /home | 
 1 | 3 | /profile | 1
 1 | 4 | /profile | 
 1 | 6 | /home | 1
 1 | 8 | /profile | 1
 2 | 2 | /home | 
 2 | 5 | /dashboard | 1
 2 | 7 | /profile | 1
(8 rows)

Then we count them to create groups.

SELECT *,
 count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
FROM (
 SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
 FROM pageviews
) AS t1
 user_id | time | path | is_reset | grp 
---------+------+------------+----------+-----
 1 | 1 | /home | | 0
 1 | 3 | /profile | 1 | 1
 1 | 4 | /profile | | 1
 1 | 6 | /home | 1 | 2
 1 | 8 | /profile | 1 | 3
 2 | 2 | /home | | 0
 2 | 5 | /dashboard | 1 | 1
 2 | 7 | /profile | 1 | 2
(8 rows)

Now we group by the user_id, and grp. We select the max(time) because you didn't specify which time to choose in your question.

SELECT user_id, max(time) AS time, path, count(*)
FROM (
 SELECT *,
 count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
 FROM (
 SELECT *,
 CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
 FROM pageviews
 ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;
 user_id | time | path | count 
---------+------+------------+-------
 1 | 1 | /home | 1
 1 | 4 | /profile | 2
 1 | 6 | /home | 1
 1 | 8 | /profile | 1
 2 | 2 | /home | 1
 2 | 5 | /dashboard | 1
 2 | 7 | /profile | 1
(7 rows)

Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.

This problem can also be solved with dense_rank, as noted. You may want to see my answer to this question for a good write up on how to do that.

Solving "Gaps and Islands" with row_number() and dense_rank()?

Stack Exchange Network

List chronological pageviews without adjacent duplicates in PostgreSQL

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

List chronological pageviews without adjacent duplicates in PostgreSQL

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions