I've got a log of pageviews. Multiple pageviews with the same path may occur next to each other if say the user refreshes the page.
To see the user's journey through the site I'd like to select pageviews for that user but eliminating duplicate adjacent pageviews for clarity.
Example simplified table (time is really a timestamp):
CREATE TABLE pageviews(user_id, time, path)
AS VALUES
( 1, 1, '/home' ),
( 2, 2, '/home' ),
( 1, 3, '/profile' ),
( 1, 4, '/profile' ),
( 2, 5, '/dashboard' ),
( 1, 6, '/home' ),
( 2, 7, '/profile' ),
( 1, 8, '/profile' );
Now assume there's going to be a WHERE user_id = 1
clause and ORDER BY time ASC
so we have:
( 1, 1, '/home' ),
( 1, 3, '/profile' ),
( 1, 4, '/profile' ),
( 1, 6, '/home' ),
( 1, 8, '/profile' );
But what I'd like is the duplicate adjacent /profile
to be ignored like so (a count of the "grouped" paths would be nice too):
| user_id | time | path | count | | ------- | ---- | -------- | ----- | | 1 | 1 | /home | 1 | | 1 | 4 | /profile | 2 | | 1 | 6 | /home | 1 | | 1 | 8 | /profile | 1 |
A simple GROUP BY
wouldn't do it as /profile
is still in there twice, I only want to "group" adjacent equal values of path
.
Also I'm not sure whether the first or last of the equal values should be returned, I don't really mind so whichever makes sense / makes for the simplest query. I'll probably flip it to be DESC
in real use but this is how I've written the tables.
Reference I found this post that does basically what I want (included below for permanence). So I'm interested to hear 2 improvements:
- Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.
- Can a count of the "grouped" values be added easily (possibly using a window function)?
So any elegant query that can achieve both of those things would be ⭐️ I couldn't find any other articles or questions about this but I feel it must be a common thing to do so any other links welcome.
SELECT w1.day, w1.rainy
FROM (SELECT
w2.day, w2.rainy,
lead(w2.rainy) OVER (ORDER BY w2.day DESC) as prev_rainy
FROM weather w2
ORDER BY w2.day DESC
) as w1
WHERE w1.rainy IS DISTINCT FROM w1.prev_rainy
ORDER BY w1.day DESC;
-
2This is called a gaps-and-islands problem.Evan Carroll– Evan Carroll2017年06月12日 16:44:21 +00:00Commented Jun 12, 2017 at 16:44
2 Answers 2
SELECT user_id, max(time) AS time, path, count(*)
FROM (
SELECT *,
count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
FROM (
SELECT *,
CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;
First what you need to do is create something you can group by, for this we compute is_resets
. For more information about why it's in a CASE
, see this question In what case is a count(x or null) needed in Gaps and Islands?
SELECT *,
CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
user_id | time | path | is_reset
---------+------+------------+----------
1 | 1 | /home |
1 | 3 | /profile | 1
1 | 4 | /profile |
1 | 6 | /home | 1
1 | 8 | /profile | 1
2 | 2 | /home |
2 | 5 | /dashboard | 1
2 | 7 | /profile | 1
(8 rows)
Then we count them to create groups.
SELECT *,
count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
FROM (
SELECT *,
CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
) AS t1
user_id | time | path | is_reset | grp
---------+------+------------+----------+-----
1 | 1 | /home | | 0
1 | 3 | /profile | 1 | 1
1 | 4 | /profile | | 1
1 | 6 | /home | 1 | 2
1 | 8 | /profile | 1 | 3
2 | 2 | /home | | 0
2 | 5 | /dashboard | 1 | 1
2 | 7 | /profile | 1 | 2
(8 rows)
Now we group by the user_id, and grp. We select the max(time)
because you didn't specify which time to choose in your question.
SELECT user_id, max(time) AS time, path, count(*)
FROM (
SELECT *,
count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
FROM (
SELECT *,
CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;
user_id | time | path | count
---------+------+------------+-------
1 | 1 | /home | 1
1 | 4 | /profile | 2
1 | 6 | /home | 1
1 | 8 | /profile | 1
2 | 2 | /home | 1
2 | 5 | /dashboard | 1
2 | 7 | /profile | 1
(7 rows)
Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.
This problem can also be solved with dense_rank, as noted. You may want to see my answer to this question for a good write up on how to do that.
The answer from Evan Carroll gave me what I needed. I tried the approach he provided (modified to select just one user_id
):
SELECT min(time) AS time, path, count(*)
FROM (
SELECT *, count(is_reset) OVER (ORDER BY time DESC) AS grp
FROM (
SELECT *, CASE WHEN path <> lag(path) OVER (ORDER BY time DESC) THEN 1 END AS is_reset
FROM pageviews WHERE user_id = 1
) AS t1
) AS t2
GROUP BY grp, path
ORDER BY time DESC;
And via the links he provided I was finally able to get a window function version to work:
SELECT min(time) AS time, path, count(grp) FROM (
SELECT *, row_number() OVER (ORDER BY time DESC) -
row_number() OVER (PARTITION BY path ORDER BY time DESC) AS grp
FROM pageviews WHERE user_id = 26455
) i
GROUP BY path, grp
ORDER BY time DESC;
On my real dataset performance is about the same with both taking ~ 0.5s, and the query plans came out similar though not identical. So I used the latter window function version as it's shorter with fewer nested selects, though I think the prior lag version may be easier to understand.
Explore related questions
See similar questions with these tags.