Schema:
CREATE TABLE traffic_hit (
id SERIAL NOT NULL PRIMARY KEY,
country VARCHAR(2) NOT NULL,
created TIMESTAMP WITH TIME ZONE NOT NULL,
unique BOOLEAN NOT NULL,
user_agent_id INTEGER NULL
);
CREATE TABLE utils_useragent (
id SERIAL NOT NULL PRIMARY KEY,
user_agent_string TEXT NOT NULL UNIQUE,
is_robot BOOLEAN NOT NULL
);
Initial Query:
SELECT
traffic_hit.created::DATE AS group_by,
COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
COUNT(*) AS non_unique_visits
FROM
traffic_hit
LEFT JOIN utils_useragent ON traffic_hit.user_agent_id = utils_useragent.id
WHERE
traffic_hit.created >= '2016-01-01' AND
traffic_hit.created < '2017-01-01' AND
traffic_hit.country = 'CZ' AND
utils_useragent.is_robot = FALSE
GROUP BY 1
Indexes:
CREATE INDEX traffic_hit_user_agent_id ON traffic_hit (user_agent_id);
CREATE INDEX new_idx ON traffic_hit(created, country, user_agent_id, unique);
CREATE INDEX robots ON utils_useragent (id) WHERE is_robot = TRUE
Query plan:
HashAggregate (cost=582436.93..603769.28 rows=1706588 width=20) (actual time=2514.233..2515.597 rows=366 loops=1)
Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
Group Key: (traffic_hit.created)::date
-> Hash Join (cost=15732.00..545234.80 rows=4960285 width=5) (actual time=83.141..2157.453 rows=2430245 loops=1)
Output: (traffic_hit.created)::date, traffic_hit.""unique"""
Hash Cond: (traffic_hit.user_agent_id = utils_useragent.id)
-> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=0.56..448722.21 rows=5007358 width=13) (actual time=0.066..1278.475 rows=4618870 loops=1)
Output: traffic_hit.created, traffic_hit.country, traffic_hit.user_agent_id, traffic_hit.""unique"""
Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
Heap Fetches: 40448
-> Hash (cost=10806.55..10806.55 rows=393991 width=4) (actual time=77.531..77.531 rows=393896 loops=1)
Output: utils_useragent.id
Buckets: 524288 Batches: 1 Memory Usage: 17944kB
-> Index Only Scan using utils_useragent_id_idx on public.utils_useragent (cost=0.42..10806.55 rows=393991 width=4) (actual time=0.071..29.285 rows=393896 loops=1)
Output: utils_useragent.id
Heap Fetches: 5932
Planning time: 0.918 ms
Execution time: 2531.195 ms
Data:
There are about 4000 records with is_robot = TRUE
and 395000 records with is_robot = FALSE
in utils_useragent
table. Table traffic_hit contains about 12M records for 2016 year.
Goal:
Improve read performance, since the query is used in reporting application and it is important for users.
My approach:
Since there are small about of "robots" in utils_useragent table, it should be faster to use partial index. Another thing I'd like to use is multicolumn index for index only scans
SELECT
traffic_hit.created::DATE AS group_by,
COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
COUNT(*) AS non_unique_visits
FROM
traffic_hit
WHERE
traffic_hit.created >= '2016-01-01' AND
traffic_hit.created < '2017-01-01' AND
traffic_hit.country = 'CZ' AND
user_agent_id NOT IN (select id from utils_useragent where is_robot = TRUE)
GROUP BY 1
New query plan:
HashAggregate (cost=486612.46..503842.68 rows=1378418 width=20) (actual time=2281.282..2282.627 rows=366 loops=1)
Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
Group Key: (traffic_hit.created)::date
-> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=275.23..467834.60 rows=2503714 width=5) (actual time=2.223..1922.960 rows=2430245 loops=1)
Output: (traffic_hit.created)::date, traffic_hit.""unique"""
Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 2188625
Heap Fetches: 40448
SubPlan 1
-> Index Only Scan using only_robots on public.utils_useragent (cost=0.28..265.32 rows=3739 width=4) (actual time=0.031..0.682 rows=3763 loops=1)
Output: utils_useragent.id
Heap Fetches: 0
Planning time: 0.510 ms
Execution time: 2297.849 ms
New query is faster, but there is Filter: (NOT (hashed SubPlan 1))
part in the plan, which confuses me.
Question:
Why index is not used to filter by user_agent_id
? Is it possible to use it for better query performance? Or some another approach will be better?
PostgreSQL version: 9.6.3
2 Answers 2
It does use the index. It uses the index to build the hash table, which it then uses in the filter. Using an in-memory unshared hash table is going to be faster than using an on-disk shared index.
But why are you repeatedly aggregating millions of rows of data which have not changed in 6 months, if it is performance sensitive? Aggregate it once and store the result. You could use materialized views to do this, or just do it by hand.
You can do partial aggregations, for example aggregate data grouping by date(created)
and whichever other column you need to. Then people can re-aggregate this reduced data set to a specific date range as long they are happy with integral day boundaries, either filtering on those other column, or aggregating over them, or grouping by them. If they want a count, you have to careful to sum up the counts, not count the counts. If you want an average, you have to be careful to weight that average by the counts, rather than doing unweighted average of the averages.
And of course if you change your mind about what is or is not a robot, then you would have to re-make your partial aggregation table.
Anyway, the bottleneck is not the not-in statement, it is just the raw amount of data you want to process.
Parallel query in the upcoming v10 release of PostgreSQL could help this query.
-
1Great advice about storing the aggregates. As this is typically a query that is hard to optimize (once you found the relevant rows, the other conditions give just a small bit of selectivity, like removing a few thousand rows from millions), it is easier to just limit the scope.András Váczi– András Váczi2017年08月03日 19:53:18 +00:00Commented Aug 3, 2017 at 19:53
-
There are a couple of reasons, users often filter statistic by latest 90 or 180 days, for this case the aggregation should be recalculated frequently. Also, I didn't mention it, but there could be another columns in group by. I'll update the question accordinglyStranger6667– Stranger66672017年08月04日 07:46:58 +00:00Commented Aug 4, 2017 at 7:46
Using NOT EXISTS
SELECT
traffic_hit.created::DATE AS "group_by",
COUNT(*) FILTER(WHERE traffic_hit.unique) AS "unique_visits",
COUNT(*) AS "non_unique_visits"
FROM
traffic_hit AS "traffic_hit"
WHERE traffic_hit.created >= '2016-01-01'
AND traffic_hit.created < '2017-01-01'
AND traffic_hit.country = 'CZ'
AND NOT EXISTS(
SELECT 1
FROM utils_useragent
WHERE is_robot
AND utils_useragent.id = user_agent_id
)
GROUP BY 1
-
TBH I wouldn't expect this to make any difference in this very case...András Váczi– András Váczi2017年08月03日 21:03:28 +00:00Commented Aug 3, 2017 at 21:03
-
This solution has about the same performance as my approach with
NOT IN
Stranger6667– Stranger66672017年08月05日 18:41:06 +00:00Commented Aug 5, 2017 at 18:41 -
Really depends on the use-case - had a similar problem but with text field + a functional index and got a 10x performance boost.toster-cx– toster-cx2022年04月01日 14:15:19 +00:00Commented Apr 1, 2022 at 14:15
NOT IN
byNOT EXISTS
.