I have a table defined as:
CREATE TABLE addresses (
"address" varchar(42),
"other_info1" text,
"other_info2" bigint
);
and I have some manual corrections to that table defined as:
CREATE TABLE address_corrections (
"address" varchar(42),
"other_info1" text,
"other_info2" bigint
);
I have a simple union + dedup query that I am persisting as a materialized view corrected_addresses
with the same schema as the two tables above. The size of the materialized view is about the same size as the original table addresses
, which is about 400k rows. I have an index on the materialized view which matches the same exact index on the original addresses
table:
CREATE INDEX corrected_addresses_address_idx ON corrected_addresses(address);
Now when querying some other tables I want to join on this table to get access to other_info1
and other_info2
. Simple enough. However, this join appears to not use the index on address
. The query looks like this:
EXPLAIN (analyze, buffers, format text)
SELECT transfer.from_name, transfer.to_name, transfer.value,
transfer.address, address.other_info1, address.other_info2
FROM transfers transfer
LEFT JOIN corrected_addresses address ON transfer.address=address.address
WHERE transfer.from_name='some name here';
with either addresses
or corrected_addresses
as the join table.
Here is the result for the original address table:
Gather (cost=1205.50..43489.92 rows=6241 width=173) (actual time=1.422..12.292 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=12
-> Nested Loop Left Join (cost=205.50..41865.82 rows=2600 width=173) (actual time=0.184..0.198 rows=1 loops=3)
Buffers: shared hit=12
-> Parallel Bitmap Heap Scan on transfers transfer (cost=205.07..25042.94 rows=2600 width=204) (actual time=0.176..0.180 rows=1 loops=3)
Recheck Cond: ((from_name)::text = 'some name here'::text)
Heap Blocks: exact=2
Buffers: shared hit=8
-> Bitmap Index Scan on transfers_from_name_timestamp_idx (cost=0.00..203.51 rows=6241 width=0) (actual time=0.400..0.401 rows=2 loops=1)
Index Cond: ((from_name)::text = 'some name here'::text)
Buffers: shared hit=6
-> Memoize (cost=0.43..7.99 rows=1 width=55) (actual time=0.017..0.018 rows=0 loops=2)
Cache Key: transfer.address
Cache Mode: logical
Hits: 0 Misses: 2 Evictions: 0 Overflows: 0 Memory Usage: 1kB
Buffers: shared hit=4
-> Index Scan using addresses_idx on addresses a (cost=0.42..7.98 rows=1 width=55) (actual time=0.013..0.013 rows=0 loops=2)
Index Cond: ((address)::text = (transfer.address)::text)
Buffers: shared hit=4
Planning:
Buffers: shared hit=334 dirtied=1
Planning Time: 5.692 ms
Execution Time: 12.601 ms
And for the materialized view:
Gather (cost=10501.47..35970.26 rows=6241 width=173) (actual time=100.881..111.703 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5327
-> Parallel Hash Left Join (cost=9501.47..34346.16 rows=2600 width=173) (actual time=94.729..95.259 rows=1 loops=3)
Hash Cond: ((transfer.address)::text = (a.address)::text)
Buffers: shared hit=5327
-> Parallel Bitmap Heap Scan on transfers transfer (cost=205.07..25042.94 rows=2600 width=204) (actual time=0.162..0.164 rows=1 loops=3)
Recheck Cond: ((from_name)::text = 'some name here'::text)
Heap Blocks: exact=1
Buffers: shared hit=8
-> Bitmap Index Scan on transfers_from_name_timestamp_idx (cost=0.00..203.51 rows=6241 width=0) (actual time=0.070..0.071 rows=2 loops=1)
Index Cond: ((from_name)::text = 'some name here'::text)
Buffers: shared hit=6
-> Parallel Hash (cost=7051.18..7051.18 rows=179618 width=55) (actual time=92.901..92.901 rows=143694 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 41856kB
Buffers: shared hit=5255
-> Parallel Seq Scan on corrected_addresses a (cost=0.00..7051.18 rows=179618 width=55) (actual time=0.006..27.550 rows=143694 loops=3)
Buffers: shared hit=5255
Planning:
Buffers: shared hit=37
Planning Time: 0.482 ms
Execution Time: 111.758 ms
Both of those execution times are small, I believe that's partially because of caching. The difference on a fresh run is ~100ms vs. a few seconds. Sorry I couldn't get an example here.
Does anyone know why Postgres is not using the index for the materialized view?
2 Answers 2
The horrible estimation of 'transfers' condition (expected 2600, found 1) caused the cost of the hash join and the nested loop to appear falsely similar.
Why does misestimating by 2600 fold cause one query to use a different plan but not the other? No idea, probably just luck. I doubt it is even reproducible, much less interesting. Fix the root problem.
If an ANALYZE transfers
doesn't fix it, then you will need to dig into pg_stats for column "from_name".
For those who stumble across this question later, here is what I did to solve it:
SELECT transfer.from_name, transfer.to_name, transfer.value,
transfer.address, address.other_info1, address.other_info2
FROM transfers transfer
LEFT JOIN LATERAL (SELECT * FROM corrected_addresses a_inner ON transfer.address=a_inner.address LIMIT 1) AS address ON true
WHERE transfer.from_name='some name here';
I suppose the LIMIT 1
here makes this a meaningfully different query from the previous join. In my case this is desirable but that won't be the case for everyone. Still, why the index was used on the original table and not the materialized view is beyond me, but if it works it works.
-
1That's a band-aid, not a solution.Laurenz Albe– Laurenz Albe2022年05月30日 06:51:18 +00:00Commented May 30, 2022 at 6:51
Explore related questions
See similar questions with these tags.
LIMIT 1
forces the index. Still unsure why the materialized view works differently, and the lateral join is uglier, but at least I have something workable here.analyze transfer;
change anything? Maybe evenvacuum analyze transfer;