How to use index in "NOT IN" statement in PostgreSQL?

Question 1

Schema:

CREATE TABLE traffic_hit (
 id SERIAL NOT NULL PRIMARY KEY,
 country VARCHAR(2) NOT NULL,
 created TIMESTAMP WITH TIME ZONE NOT NULL,
 unique BOOLEAN NOT NULL,
 user_agent_id INTEGER NULL
);
CREATE TABLE utils_useragent (
 id SERIAL NOT NULL PRIMARY KEY,
 user_agent_string TEXT NOT NULL UNIQUE,
 is_robot BOOLEAN NOT NULL
);

Initial Query:

SELECT
 traffic_hit.created::DATE AS group_by,
 COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
 COUNT(*) AS non_unique_visits
FROM
 traffic_hit
LEFT JOIN utils_useragent ON traffic_hit.user_agent_id = utils_useragent.id
WHERE
 traffic_hit.created >= '2016-01-01' AND
 traffic_hit.created < '2017-01-01' AND
 traffic_hit.country = 'CZ' AND
 utils_useragent.is_robot = FALSE
GROUP BY 1

Indexes:

CREATE INDEX traffic_hit_user_agent_id ON traffic_hit (user_agent_id);
CREATE INDEX new_idx ON traffic_hit(created, country, user_agent_id, unique);
CREATE INDEX robots ON utils_useragent (id) WHERE is_robot = TRUE

Query plan:

HashAggregate (cost=582436.93..603769.28 rows=1706588 width=20) (actual time=2514.233..2515.597 rows=366 loops=1)
 Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
 Group Key: (traffic_hit.created)::date
 -> Hash Join (cost=15732.00..545234.80 rows=4960285 width=5) (actual time=83.141..2157.453 rows=2430245 loops=1)
 Output: (traffic_hit.created)::date, traffic_hit.""unique"""
 Hash Cond: (traffic_hit.user_agent_id = utils_useragent.id)
 -> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=0.56..448722.21 rows=5007358 width=13) (actual time=0.066..1278.475 rows=4618870 loops=1)
 Output: traffic_hit.created, traffic_hit.country, traffic_hit.user_agent_id, traffic_hit.""unique"""
 Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
 Heap Fetches: 40448
 -> Hash (cost=10806.55..10806.55 rows=393991 width=4) (actual time=77.531..77.531 rows=393896 loops=1)
 Output: utils_useragent.id
 Buckets: 524288 Batches: 1 Memory Usage: 17944kB
 -> Index Only Scan using utils_useragent_id_idx on public.utils_useragent (cost=0.42..10806.55 rows=393991 width=4) (actual time=0.071..29.285 rows=393896 loops=1)
 Output: utils_useragent.id
 Heap Fetches: 5932
Planning time: 0.918 ms
Execution time: 2531.195 ms

Data:

There are about 4000 records with is_robot = TRUE and 395000 records with is_robot = FALSE in utils_useragent table. Table traffic_hit contains about 12M records for 2016 year.

Goal:

Improve read performance, since the query is used in reporting application and it is important for users.

My approach:

Since there are small about of "robots" in utils_useragent table, it should be faster to use partial index. Another thing I'd like to use is multicolumn index for index only scans

SELECT
 traffic_hit.created::DATE AS group_by,
 COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
 COUNT(*) AS non_unique_visits
FROM
 traffic_hit
WHERE
 traffic_hit.created >= '2016-01-01' AND
 traffic_hit.created < '2017-01-01' AND
 traffic_hit.country = 'CZ' AND
 user_agent_id NOT IN (select id from utils_useragent where is_robot = TRUE)
GROUP BY 1

New query plan:

HashAggregate (cost=486612.46..503842.68 rows=1378418 width=20) (actual time=2281.282..2282.627 rows=366 loops=1)
 Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
 Group Key: (traffic_hit.created)::date
 -> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=275.23..467834.60 rows=2503714 width=5) (actual time=2.223..1922.960 rows=2430245 loops=1)
 Output: (traffic_hit.created)::date, traffic_hit.""unique"""
 Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
 Filter: (NOT (hashed SubPlan 1))
 Rows Removed by Filter: 2188625
 Heap Fetches: 40448
 SubPlan 1
 -> Index Only Scan using only_robots on public.utils_useragent (cost=0.28..265.32 rows=3739 width=4) (actual time=0.031..0.682 rows=3763 loops=1)
 Output: utils_useragent.id
 Heap Fetches: 0
Planning time: 0.510 ms
Execution time: 2297.849 ms

New query is faster, but there is Filter: (NOT (hashed SubPlan 1)) part in the plan, which confuses me.

Question:

Why index is not used to filter by user_agent_id? Is it possible to use it for better query performance? Or some another approach will be better?

PostgreSQL version: 9.6.3

Question 2

Try to replace NOT IN by NOT EXISTS.

Question 3

Does an index like: (country, created, user_agent_id, unique) help?

Question 4

It does use the index. It uses the index to build the hash table, which it then uses in the filter. Using an in-memory unshared hash table is going to be faster than using an on-disk shared index.

But why are you repeatedly aggregating millions of rows of data which have not changed in 6 months, if it is performance sensitive? Aggregate it once and store the result. You could use materialized views to do this, or just do it by hand.

You can do partial aggregations, for example aggregate data grouping by date(created) and whichever other column you need to. Then people can re-aggregate this reduced data set to a specific date range as long they are happy with integral day boundaries, either filtering on those other column, or aggregating over them, or grouping by them. If they want a count, you have to careful to sum up the counts, not count the counts. If you want an average, you have to be careful to weight that average by the counts, rather than doing unweighted average of the averages.

And of course if you change your mind about what is or is not a robot, then you would have to re-make your partial aggregation table.

Anyway, the bottleneck is not the not-in statement, it is just the raw amount of data you want to process.

Parallel query in the upcoming v10 release of PostgreSQL could help this query.

Question 5

Great advice about storing the aggregates. As this is typically a query that is hard to optimize (once you found the relevant rows, the other conditions give just a small bit of selectivity, like removing a few thousand rows from millions), it is easier to just limit the scope.

Question 6

There are a couple of reasons, users often filter statistic by latest 90 or 180 days, for this case the aggregation should be recalculated frequently. Also, I didn't mention it, but there could be another columns in group by. I'll update the question accordingly

Question 7

Using NOT EXISTS

SELECT
 traffic_hit.created::DATE AS "group_by",
 COUNT(*) FILTER(WHERE traffic_hit.unique) AS "unique_visits",
 COUNT(*) AS "non_unique_visits"
FROM
 traffic_hit AS "traffic_hit"
WHERE traffic_hit.created >= '2016-01-01'
 AND traffic_hit.created < '2017-01-01'
 AND traffic_hit.country = 'CZ'
 AND NOT EXISTS(
 SELECT 1
 FROM utils_useragent
 WHERE is_robot
 AND utils_useragent.id = user_agent_id
 )
GROUP BY 1

Question 8

TBH I wouldn't expect this to make any difference in this very case...

Question 9

This solution has about the same performance as my approach with NOT IN

Question 10

Really depends on the use-case - had a similar problem but with text field + a functional index and got a 10x performance boost.

jjanes jjanes 42.5k3 gold badges44 silver badges54 bronze badges · Answer 1 · 2017-08-03 19:34:45Z

It does use the index. It uses the index to build the hash table, which it then uses in the filter. Using an in-memory unshared hash table is going to be faster than using an on-disk shared index.

But why are you repeatedly aggregating millions of rows of data which have not changed in 6 months, if it is performance sensitive? Aggregate it once and store the result. You could use materialized views to do this, or just do it by hand.

You can do partial aggregations, for example aggregate data grouping by date(created) and whichever other column you need to. Then people can re-aggregate this reduced data set to a specific date range as long they are happy with integral day boundaries, either filtering on those other column, or aggregating over them, or grouping by them. If they want a count, you have to careful to sum up the counts, not count the counts. If you want an average, you have to be careful to weight that average by the counts, rather than doing unweighted average of the averages.

And of course if you change your mind about what is or is not a robot, then you would have to re-make your partial aggregation table.

Anyway, the bottleneck is not the not-in statement, it is just the raw amount of data you want to process.

Parallel query in the upcoming v10 release of PostgreSQL could help this query.

Great advice about storing the aggregates. As this is typically a query that is hard to optimize (once you found the relevant rows, the other conditions give just a small bit of selectivity, like removing a few thousand rows from millions), it is easier to just limit the scope.
There are a couple of reasons, users often filter statistic by latest 90 or 180 days, for this case the aggregation should be recalculated frequently. Also, I didn't mention it, but there could be another columns in group by. I'll update the question accordingly

Evan Carroll Evan Carroll 65.7k50 gold badges259 silver badges511 bronze badges · Answer 2 · 2017-08-03 20:50:18Z

2

Using NOT EXISTS

SELECT
 traffic_hit.created::DATE AS "group_by",
 COUNT(*) FILTER(WHERE traffic_hit.unique) AS "unique_visits",
 COUNT(*) AS "non_unique_visits"
FROM
 traffic_hit AS "traffic_hit"
WHERE traffic_hit.created >= '2016-01-01'
 AND traffic_hit.created < '2017-01-01'
 AND traffic_hit.country = 'CZ'
 AND NOT EXISTS(
 SELECT 1
 FROM utils_useragent
 WHERE is_robot
 AND utils_useragent.id = user_agent_id
 )
GROUP BY 1

Share

Improve this answer

answered Aug 3, 2017 at 20:50

Evan Carroll's user avatar

Evan Carroll Evan Carroll

65.7k50 gold badges259 silver badges511 bronze badges

3

TBH I wouldn't expect this to make any difference in this very case...

András Váczi
– András Váczi

2017年08月03日 21:03:28 +00:00
Commented Aug 3, 2017 at 21:03
This solution has about the same performance as my approach with NOT IN

Stranger6667
– Stranger6667

2017年08月05日 18:41:06 +00:00
Commented Aug 5, 2017 at 18:41
Really depends on the use-case - had a similar problem but with text field + a functional index and got a 10x performance boost.

toster-cx
– toster-cx

2022年04月01日 14:15:19 +00:00
Commented Apr 1, 2022 at 14:15

Add a comment |

Stack Exchange Network

How to use index in "NOT IN" statement in PostgreSQL?

2 Answers 2

Using NOT EXISTS

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to use index in "NOT IN" statement in PostgreSQL?

2 Answers 2

Using NOT EXISTS

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions