We've created the following table:
CREATE TABLE public.influencers
(
id integer NOT NULL DEFAULT nextval('influencers_id_seq'::regclass),
location jsonb,
gender text COLLATE pg_catalog."default",
birthdate timestamp without time zone,
ig jsonb,
contact_info jsonb,
created_at timestamp without time zone DEFAULT now(),
updated_at timestamp without time zone DEFAULT now(),
categories text[] COLLATE pg_catalog."default",
search_field text COLLATE pg_catalog."default",
search_vector tsvector,
ig_updated_at timestamp without time zone,
CONSTRAINT influencers_pkey PRIMARY KEY (id),
CONSTRAINT ig_id_must_exist CHECK (ig ? 'id'::text),
CONSTRAINT ig_username_must_exist CHECK (ig ? 'username'::text)
)
And created the following index in order to filter a field inside ig
that contains integers. This is the index we created:
CREATE INDEX idx_btree_ig_follower_count
ON public.influencers
USING BTREE ((ig->>'follower_count'));
The following query should return 427000 results. Instead, it returns a random number that nothing has to do with what it should return. This is the query:
SELECT count(*)
FROM "influencers"
WHERE ((ig ->> 'follower_count') >= '1000')
AND ((ig ->> 'follower_count') <= '10000')
AND (ig->>'follower_count') IS NOT NULL
We suspect that it is comparing strings (or text) instead of integers and that's why it's returning incomprehensible results.
However, the query that works (return expected rows) but doesn't use the idx_btree_ig_follower_count
index is the following:
SELECT count(*)
FROM "influencers"
WHERE ((ig -> 'follower_count') >= '1000')
AND ((ig -> 'follower_count') <= '10000')
AND (ig->'follower_count') IS NOT NULL;
How can we create the query or the index in order to being able to query efficiently?
1 Answer 1
Try this.. Just out of curiosity..
CREATE INDEX idx_btree_ig_follower_count
ON public.influencers
USING BTREE (CAST(ig->>'follower_count' AS int));
Then in your query, cast to int.
SELECT count(*)
FROM "influencers"
WHERE ((ig ->> 'follower_count')::int >= '1000')
AND ((ig ->> 'follower_count')::int <= '10000');
You don't need to specify IS NOT NULL
if you specify >=
, and <=
Test case
I created some test data with 10 million rows of jsonb, with btree index over it.
CREATE TABLE influences
AS
SELECT
jsonb_build_object('follower_count', trunc(random()*1e7)::int) AS ig
FROM generate_series(1,1e7);
CREATE INDEX ON influences( CAST( (ig->>'follower_count') AS int) );
I'm getting the same query plan you provided,. You can see mine here.
EXPLAIN (ANALYZE,BUFFERS)
SELECT count(*) FROM influences
WHERE (ig->>'follower_count')::int BETWEEN 0 AND 854780;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=130124.36..130124.37 rows=1 width=0) (actual time=482.615..482.615 rows=1 loops=1)
Buffers: shared hit=85674
-> Bitmap Heap Scan on influences (cost=18483.87..127946.94 rows=870969 width=0) (actual time=206.182..426.880 rows=855570 loops=1)
Recheck Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
Heap Blocks: exact=83334
Buffers: shared hit=85674
-> Bitmap Index Scan on influences_int4_idx (cost=0.00..18266.13 rows=870969 width=0) (actual time=182.158..182.158 rows=855570 loops=1)
Index Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
Buffers: shared hit=2340
Planning time: 0.341 ms
Execution time: 482.665 ms
(11 rows)
I'm aggregating far more rows, 855570 in my example, and it's over 10x faster.
I clustered it
CLUSTER influences USING influences_int4_idx ;
And, that halfed the time further to 200ms
EXPLAIN (ANALYZE,BUFFERS) SELECT count(*) FROM influences WHERE (ig->>'follower_count')::int BETWEEN 0 AND 854780;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=130124.66..130124.67 rows=1 width=0) (actual time=211.880..211.881 rows=1 loops=1)
Buffers: shared hit=4 read=9466
-> Bitmap Heap Scan on influences (cost=18483.94..127947.22 rows=870976 width=0) (actual time=69.503..155.670 rows=855570 loops=1)
Recheck Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
Heap Blocks: exact=7130
Buffers: shared hit=4 read=9466
-> Bitmap Index Scan on influences_int4_idx (cost=0.00..18266.20 rows=870976 width=0) (actual time=68.150..68.150 rows=855570 loops=1)
Index Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
Buffers: shared hit=3 read=2337
Planning time: 0.412 ms
Execution time: 211.929 ms
(11 rows)
-
It gives an error when trying to create the index: ERROR: syntax error at or near "::" LINE 3: USING BTREE ((ig->>'follower_count')::int);borjagvo– borjagvo2017年05月17日 16:04:10 +00:00Commented May 17, 2017 at 16:04
-
@borjagvo try again with the update.Evan Carroll– Evan Carroll2017年05月17日 16:07:21 +00:00Commented May 17, 2017 at 16:07
-
This is the result: gist.github.com/bgvo/4d7b75de6d5d9e8d39fbc170990ea974 It took 3 secs.borjagvo– borjagvo2017年05月17日 16:16:01 +00:00Commented May 17, 2017 at 16:16
-
Can you paste the query plan with formatting and also the query, @borjagvoEvan Carroll– Evan Carroll2017年05月17日 17:10:51 +00:00Commented May 17, 2017 at 17:10
-
Sorry, what do you mean with formatting? I just took out spaces at the beginning for each line of the explain. Here it is what PgAdmin returns when running explain: gist.github.com/bgvo/df4d4e2dd8a26c4f3dcdbda9869e2de4 Anything else that you need?borjagvo– borjagvo2017年05月18日 10:59:29 +00:00Commented May 18, 2017 at 10:59
Explore related questions
See similar questions with these tags.
WHERE ((ig ->> 'follower_count')::int >= 1000) ...