I have a query which is designed to loop and search addresses for duplicates, the query uses REGEX_REPLACE. I am trying to index on the regex as on doing an explain and its doing a sequential scan on the user_property table with a filter on the regex
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) with user_detail AS (
SELECT user_id,
max(user_property_value) FILTER (WHERE user_property_type_id = 6 ) AS FIRST_NAME,
max(user_property_value) FILTER (WHERE user_property_type_id = 7 ) AS LAST_NAME,
max(TO_DATE(user_property_value, 'YYYY-MM-DD')) FILTER (WHERE user_property_type_id = 8 ) AS DOB,
max(user_property_value) FILTER (WHERE user_property_type_id = 33 ) AS BIRTH_NUMBER
FROM PUBLIC.user_property cp
JOIN PUBLIC.user c using (user_id)
WHERE c.user_group_id= '38'
AND cp.user_property_is_active
GROUP BY user_id
),
duplicate as (
SELECT COALESCE(MAX(
CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
THEN 1 ELSE 0 END), 0) AS dup_name_address,
COALESCE(MAX(CASE WHEN REGEXP_REPLACE(UPPER(address_postcode), E'\\_|\\W','','g') = 'WD17 1JY' THEN 1 ELSE 0 END), 0) AS dup_name_postcode
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
(REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Len')
OR
(REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Len')
)
AND cd.user_id != '2589384'
), dup_dob_address AS (
SELECT
COALESCE(MAX(CASE WHEN
(cd.DOB IS NOT NULL AND cd.DOB = '1982-06-14 00:00:00') OR (cd.BIRTH_NUMBER IS NOT NULL AND cd.BIRTH_NUMBER = null )
THEN 1 ELSE 0 END), 0) AS dob
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
)
AND cd.user_id != '2589384'
)
SELECT * FROM duplicate, dup_dob_address;
Explain result:
Nested Loop (cost=492738.45..492738.50 rows=1 width=12) (actual time=7589.136..7590.933 rows=1 loops=1)
Output: (COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0))
Buffers: shared hit=931500 read=103761
CTE user_detail
-> Finalize HashAggregate (cost=423105.99..426854.87 rows=374888 width=104) (actual time=6110.633..6172.107 rows=115625 loops=1)
Output: cp.user_id, max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
Group Key: cp.user_id
Buffers: shared hit=908203 read=103761
-> Gather (cost=335007.31..413733.79 rows=749776 width=104) (actual time=6024.383..6062.501 rows=115625 loops=1)
Output: cp.user_id, (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7))), (PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33)))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=908203 read=103761
-> Partial HashAggregate (cost=334007.31..337756.19 rows=374888 width=104) (actual time=6017.847..6037.215 rows=38542 loops=3)
Output: cp.user_id, PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
Group Key: cp.user_id
Buffers: shared hit=908203 read=103761
Worker 0: actual time=6017.372..6035.986 rows=37261 loops=1
Buffers: shared hit=292969 read=33275
Worker 1: actual time=6012.321..6032.378 rows=40788 loops=1
Buffers: shared hit=320593 read=35787
-> Nested Loop (cost=1630.78..321001.76 rows=520222 width=30) (actual time=48.770..5900.888 rows=434730 loops=3)
Output: cp.user_id, cp.user_property_value, cp.user_property_type_id
Buffers: shared hit=908203 read=103761
Worker 0: actual time=45.466..5905.504 rows=420402 loops=1
Buffers: shared hit=292969 read=33275
Worker 1: actual time=44.758..5889.927 rows=459654 loops=1
Buffers: shared hit=320593 read=35787
-> Parallel Bitmap Heap Scan on public.user c (cost=1630.22..22201.58 rows=48268 width=4) (actual time=26.536..39.410 rows=38542 loops=3)
Output: c.user_id, c.currency_code, c.user_group_id, c.user_created_on, c.user_status_id, c.user_max_credit, c.user_last_updated_on, c.user_version
Recheck Cond: (c.user_group_id = 38)
Heap Blocks: exact=2249
Buffers: shared hit=6896 read=319
Worker 0: actual time=22.735..35.486 rows=37261 loops=1
Buffers: shared hit=2303
Worker 1: actual time=22.766..36.418 rows=40788 loops=1
Buffers: shared hit=2343
-> Bitmap Index Scan on idx_user_user_group_id (cost=0.00..1601.26 rows=115844 width=0) (actual time=33.224..33.224 rows=115625 loops=1)
Index Cond: (c.user_group_id = 38)
Buffers: shared hit=1 read=319
-> Index Scan using idx_user_id_user_property on public.user_property cp (cost=0.56..5.51 rows=68 width=30) (actual time=0.036..0.150 rows=11 loops=115625)
Output: cp.user_id, cp.user_property_type_id, cp.user_property_created_on, cp.user_property_is_active, cp.user_property_value, cp.user_property_upper_value, cp.user_property_version
Index Cond: (cp.user_id = c.user_id)
Buffers: shared hit=901307 read=103442
Worker 0: actual time=0.038..0.156 rows=11 loops=37261
Buffers: shared hit=290666 read=33275
Worker 1: actual time=0.034..0.142 rows=11 loops=40788
Buffers: shared hit=318250 read=35787
-> Aggregate (cost=19766.95..19766.96 rows=1 width=8) (actual time=6882.602..6882.605 rows=1 loops=1)
Output: COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0), COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)
Buffers: shared hit=908203 read=103761
-> Nested Loop Left Join (cost=0.42..19766.22 rows=21 width=110) (actual time=6882.596..6882.597 rows=0 loops=1)
Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode
Buffers: shared hit=908203 read=103761
-> CTE Scan on user_detail cd (cost=0.00..19681.62 rows=19 width=4) (actual time=6882.595..6882.595 rows=0 loops=1)
Output: cd.user_id, cd.first_name, cd.last_name, cd.dob, cd.birth_number
Filter: ((cd.user_id <> 2589384) AND (((regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text)) OR ((regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text))))
Rows Removed by Filter: 115625
Buffers: shared hit=908203 read=103761
-> Index Scan using address_idx_01 on public.address ad (cost=0.42..4.44 rows=1 width=114) (never executed)
Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode, ad.user_id
Index Cond: (ad.user_id = cd.user_id)
-> Aggregate (cost=46116.63..46116.64 rows=1 width=4) (actual time=706.525..707.941 rows=1 loops=1)
Output: COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0)
Buffers: shared hit=23297
-> Hash Join (cost=36282.83..46116.62 rows=1 width=36) (actual time=706.520..707.934 rows=0 loops=1)
Output: cd_1.dob, cd_1.birth_number
Hash Cond: (cd_1.user_id = ad_1.user_id)
Buffers: shared hit=23297
-> CTE Scan on user_detail cd_1 (cost=0.00..8434.98 rows=373014 width=40) (actual time=0.002..0.003 rows=1 loops=1)
Output: cd_1.user_id, cd_1.first_name, cd_1.last_name, cd_1.dob, cd_1.birth_number
Filter: (cd_1.user_id <> 2589384)
-> Hash (cost=36282.82..36282.82 rows=1 width=4) (actual time=706.499..707.911 rows=0 loops=1)
Output: ad_1.user_id
Buckets: 1024 Batches: 1 Memory Usage: 8kB
Buffers: shared hit=23297
-> Gather (cost=1000.00..36282.82 rows=1 width=4) (actual time=706.496..707.907 rows=0 loops=1)
Output: ad_1.user_id
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=23297
-> Parallel Seq Scan on public.address ad_1 (cost=0.00..35282.72 rows=1 width=4) (actual time=701.969..701.970 rows=0 loops=3)
Output: ad_1.user_id
Filter: (((ad_1.address_country_code)::text = 'GB'::text) AND (regexp_replace((ad_1.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND (regexp_replace((ad_1.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((COALESCE(regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)))
Rows Removed by Filter: 295033
Buffers: shared hit=23297
Worker 0: actual time=699.642..699.644 rows=0 loops=1
Buffers: shared hit=7331
Worker 1: actual time=700.498..700.499 rows=0 loops=1
Buffers: shared hit=7984
Planning Time: 17.292 ms
Execution Time: 7601.934 ms
https://explain.depesz.com/s/cbmv#html
I looked at a similar post regarding using the pg_trgm extension but made no difference when trying to index.
create index concurrently on address using gin (address_place gin_trgm_ops);
But
The size of the user_property table is approx 2.5 million rows with the size of the address table also very small < 1.7m rows.
Is there an efficient way to index on Regex_replace? or would a redesign of the query be needed?
Any help much appreciated.
1 Answer 1
I found this an interesing question +1!
In order to answer, I went down the old-fashioned route and did some testing.
All of the code below can be found on the fiddle here. The strategy employed uses GENERATED
columns (Manual). You could also use expression (aka functional) indexes - see note at the bottom of this answer).
CREATE TABLE test
(
t_id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
address TEXT,
post_code TEXT,
add_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(address, 'Building', 'BUILDING')) STORED,
p_c_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(post_code, 'abc', 'ABC')) STORED
);
and populate with a few records:
INSERT INTO test (address, post_code) VALUES
('The Building, Apt 13, Flr 6', 'abc123'),
('The Building, Apt 45, Flr 8', 'abc456'),
('The Building, Apt 45, Flr 9', 'abc789');
and now, the secret sauce - indexing those GENERATED
columns:
CREATE INDEX my_field_regexp_idx ON test (add_bis);
CREATE INDEX post_code_regexp_idx ON test(p_c_bis);
and just to check (always check):
SELECT * FROM test;
Result:
t_id address post_code add_bis p_c_bis
1 The Building, Apt 13, Flr 6 abc123 The BUILDING, Apt 13, Flr 6 ABC123
2 The Building, Apt 45, Flr 8 abc456 The BUILDING, Apt 45, Flr 8 ABC456
3 The Building, Apt 45, Flr 9 abc789 The BUILDING, Apt 45, Flr 9 ABC789
First, we run this:
SET enable_seqscan = OFF;
This doesn't actually disable sequential table scans, it just makes them very expensive - see discussion below.
Do not do this on production systems, or at least don't do it globally. You could, if and only if you fully understand any consequences, do it on a case-by-case, query-by-query basis, but it's not to be recommended. Today's query hints are tomorrow's bugs - use with caution.
The reason I'm doing it here is to force the optimiser to choose the index over a sequential scan. Without enable_seqscan = OFF
, the very small sample tables here would cause the optimiser to automatically choose a sequential scan. With a large number of records on a production system, this should not be a problem.
From the documentation here:
enable_seqscan (boolean)
Enables or disables the query planner's use of sequential scan plan types.
It is impossible to suppress sequential scans entirely
, but turning this variable off discourages the planner from using one if there are other methods available. The default is on.
[Emphasis mine] - also, see the discussion below.
and then we run:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
*
FROM
test
WHERE add_bis LIKE 'The BUILDING';
Result:
QUERY PLAN
Index Scan using address_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)
Output: t_id, address, post_code, add_bis, p_c_bis
Index Cond: (test.add_bis = 'The BUILDING'::text)
Filter: (test.add_bis ~~ 'The BUILDING'::text)
Buffers: shared read=1
Planning:
Buffers: shared hit=22
Planning Time: 0.150 ms
Execution Time: 0.042 ms
Excellent - the result we want - an Index Scan
:
Index Scan using address_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)
Then, we run:
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
p_c_bis
FROM test
WHERE p_c_bis = 'ABC123';
where the data requested is entirely contained in the index (of the REGEXP_REPLACE()
ed column) and et voilà:
Index Only Scan using post_code_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=32) (actual time=0.039..0.040 rows=1 loops=1)
Output: p_c_bis
Index Cond: (test.p_c_bis = 'ABC123'::text)
Heap Fetches: 1
Buffers: shared hit=1 read=1
Planning Time: 0.043 ms
Execution Time: 0.064 ms
We have an Index Only Scan
which is what we wanted! There's another example of an Index Only Scan
in the fiddle.
Interestingly, when I rerun this (at the end):
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
*
FROM test
WHERE add_bis LIKE 'The BUILDING%';
Result:
Seq Scan on public.test (cost=10000000000.00..10000000001.04 rows=1 width=132) (actual time=0.008..0.009 rows=3 loops=1)
Output: t_id, address, post_code, add_bis, p_c_bis
Filter: (test.add_bis ~~ 'The BUILDING%'::text)
Buffers: shared hit=1
Planning Time: 0.038 ms
Execution Time: 0.027 ms
I can only surmise (a nice word for "guess"!) that at this point, the optimiser knows that the table is in the memory buffer and that a Seq Scan
will be the fastest/least-costly option no matter what (see the very high cost (cost=100000000000
)). This chimes nicely with the documentation above ("It is impossible to suppress sequential scans entirely").
I've seen this before, quite why it uses the index for my first query and not for this second one is a mystery to me - the source code is a bit above my pay grade I'm afraid.
- As with all answers on StackExchange, there is no substitute for testing on your own system with your own hardware!
Question:
Is there an efficient way to index on Regex_replace?
Answer:
A REGEXP_REPLACE()
strategy will work if you use GENERATED
columns.
Final notes:
1. I also tried expression indexes
(Manual), i.e. no GENERATED
columns and the syntax would be as follows:
CREATE INDEX address_regexp_idx ON test (REGEXP_REPLACE(address, 'Building', 'BUILDING'));`
but I couldn't get any sort of Index Scan
to work (on the fiddle ) in either case - YMMV. I would urge you to check out this strategy also - you may or may not achieve the results you want that way, if the extra space tradeoff works or not for you.
2. Please always include your version of PostgreSQL in your questions.
3. The DDL of the cd
table from your query might have been useful - I've just gone with general principles, but as a rule of thumb, the more information you include in your questions the better.
4. Your query contains the lines (for example):
CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'`
and
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'`
These conditions can never be met.
See the end of the fiddle:
SELECT REGEXP_REPLACE(('_Flat 25 Arl_iss Co__urt 24'), E'\\_|\\W','','g');
Result:
regexp_replace
Flat25ArlissCourt24 -- != Flat 25 Arliss Court 24
So, a string with no spaces can never be equal to a string with them.
5. Finally, you are using the outdated E'\\_|\\W'
notation - far more legible would be this '_|\W'
. There is no need to escape the _
(underscore) character - it has no special meaning in regexes and all of those backslashes make the regex difficult to read (IMHO - regexes are bad enough on their own!).
-
thanks for the detailed explanation, if the query address was different as in a different road/postcode/address line etc that would render the index unusable for the index such as this CREATE INDEX address_regexp_idx ON test (REGEXP_REPLACE(address, 'Building', 'BUILDING')); Setting sequential scans off may not be a good thing for me, have always been informed this shouldnt be done on production. I like the idea of the GENERATED columns would need to do some testing, will try out expression index as that "may" give us a quick win.rdbmsNoob– rdbmsNoob2022年12月12日 08:50:08 +00:00Commented Dec 12, 2022 at 8:50
-
1@rdbmsNoob Maybe I should have been clearer - under no circumstances would I advocate using
SET enable_seqscan = OFF
on PROD. I was using it to illustrate how PostgreSQL could use an index on a REGEXP_REPLACE. You can disable a number of query planner methods but if you were to do this in PROD, I would only suggest doing it on a case-by-case, query-by-query basis. I will modify my answer in order to emphasise this point.Vérace– Vérace2022年12月12日 09:24:05 +00:00Commented Dec 12, 2022 at 9:24 -
Thanks @Verace, yes it does, I was curious as this isnt purely just one row of the same address and infact multiple different types of addresses and the query is to look for duplicate addresses. So wouldnt be able to index on each address type. Have also tried indexing using expression indexes but thus no joy so far. I may look at maintaining the data in a separate table and introduce triggers but thinking it could get messy.rdbmsNoob– rdbmsNoob2022年12月14日 16:19:43 +00:00Commented Dec 14, 2022 at 16:19
-
1Sounds like you require a bit of data cleansing? Anyway, I've put in the fiddle that I neglected to put in previously. I think I've answered the question as asked correctly. If there's anything further, you could ask another question, maybe pointing back to this one for context if necessary!Vérace– Vérace2022年12月15日 01:07:35 +00:00Commented Dec 15, 2022 at 1:07
Explore related questions
See similar questions with these tags.