I am using postgresql
. I have a product table which has 10 million records. Everyday, I used to update around 1 million records. Here is the query which took around 7 hours and did not completed so I have to stop the query execution.
update products set clean_brand_id = brands.id from brands where lower(brand) = lower(brands.name) and products.created_at >= current_date
I tried to modify the query to update only 100 record, still it took so many time, this also failed. Here is the modified query.
update products set clean_brand_id = brands.id from brands where lower(brand) = lower(brands.name) and products.created_at >= current_date and products.id in (select id from products where products.created_at >= current_date and clean_brand_id is null limit 100) and clean_brand_id is null;
Can any one help to find out how to improve the query or modify the table structure for this. Index is implemented for brand
and clean_brand_id
.
There are around 50 columns and 15 indexes used in this table.
The version of postgresql
is 9.5
EDIT:
Here the result of EXPLAIN plan for the first query
QUERY PLAN
----------------------------------------------------------------------------------------------
Update on products (cost=2934598.99..4232174.12 rows=64819280 width=3220)
-> Merge Join (cost=2934598.99..4232174.12 rows=64819280 width=3220)
Merge Cond: ((lower((brands.name)::text)) = (lower((products.brand)::text)))
-> Sort (cost=4095.37..4201.01 rows=42257 width=24)
Sort Key: (lower((brands.name)::text))
-> Seq Scan on brands (cost=0.00..848.57 rows=42257 width=24)
-> Materialize (cost=2930503.62..2932037.55 rows=306786 width=3210)
-> Sort (cost=2930503.62..2931270.58 rows=306786 width=3210)
Sort Key: (lower((products.brand)::text))
-> Seq Scan on products (cost=0.00..2053185.87 rows=306786 width=3210)
Filter: (created_at >= ('now'::cstring)::date)
(11 rows)
Edit - 2
After removing the duplicate brand names from brand table and its reference from products table, here is the EXPLAIN plan for the SQL query
Update on products (cost=1442396.71..1443511.50 rows=1 width=3232)
-> Hash Join (cost=1442396.71..1443511.50 rows=1 width=3232)
Hash Cond: (lower((brands.name)::text) = lower((products.brand)::text))
-> Seq Scan on brands (cost=0.00..848.57 rows=42257 width=24)
-> Hash (cost=1442396.70..1442396.70 rows=1 width=3222)
-> Nested Loop (cost=1441549.13..1442396.70 rows=1 width=3222)
-> HashAggregate (cost=1441548.70..1441549.70 rows=100 wi
dth=32)
Group Key: "ANY_subquery".id
-> Subquery Scan on "ANY_subquery" (cost=66088.59..
1441548.45 rows=100 width=32)
-> Limit (cost=66088.59..1441547.45 rows=100
width=4)
-> Bitmap Heap Scan on products products
_1 (cost=66088.59..1936712.63 rows=136 width=4)
Recheck Cond: (clean_brand_id IS NU
LL)
Filter: (created_at >= ('now'::cstr
ing)::date)
-> Bitmap Index Scan on index_prod
ucts_on_clean_brand_id (cost=0.00..66088.56 rows=1370683 width=0)
Index Cond: (clean_brand_id I
S NULL)
-> Index Scan using products_pkey on products (cost=0.43.
.8.46 rows=1 width=3194)
Index Cond: (id = "ANY_subquery".id)
Filter: ((clean_brand_id IS NULL) AND (created_at >=
('now'::cstring)::date))
(18 rows)
another factor for less time taken by the second EXPLAIN Plan is in the earlier execution number of new records are around .8 million, and in the second execution number of new records are much less(exact number not known)
Can anyone help me to find out if any of the existing indexes are utilized in the SQL query execution or not. If I can improve the execution of this query by creating any joint index on multiple columns.
Edit - 3
Added two new indexes on lower(brand) and created_at in products
CREATE INDEX products_lower_brand_created_at_idx ON products (created_at, lower(brand));
CREATE INDEX brands_lower_brand_idx ON products (lower(name));
Here is the EXPLAIN plan for the same query
Update on products (cost=7.97..435.50 rows=1 width=3187)
-> Nested Loop Semi Join (cost=7.97..435.50 rows=1 width=3187)
Join Filter: (products.id = "ANY_subquery".id)
-> Hash Join (cost=7.54..426.80 rows=77 width=3159)
Hash Cond: (lower((brands.name)::text) = lower((products.brand)::text))
-> Seq Scan on brands (cost=0.00..321.76 rows=15476 width=24)
-> Hash (cost=7.53..7.53 rows=1 width=3149)
-> Index Scan using products_lower_brand_created_at_idx on products (cost=0.43..7.53 rows=1 width=3149)
Index Cond: (created_at >= ('now'::cstring)::date)
Filter: (clean_brand_id IS NULL)
-> Materialize (cost=0.43..7.54 rows=1 width=32)
-> Subquery Scan on "ANY_subquery" (cost=0.43..7.54 rows=1 width=32)
-> Limit (cost=0.43..7.53 rows=1 width=4)
-> Index Scan using products_lower_brand_created_at_idx on products products_1 (cost=0.43..7.53 rows=1 width=4)
Index Cond: (created_at >= ('now'::cstring)::date)
Filter: (clean_brand_id IS NULL)
These changes are implemented in local database and number of records are around .4 million. Can anyone help to findout if the new index can help in improving the query execution time
2 Answers 2
First, if products.brand
, and brand.name
are the same regardless of case then use citext
. In order to get there, we have to find out where they collissions are. @RDFozz is right in the comments. We can figure out by analyzing both tables.
SELECT lower(brand), array_agg(brand)
FROM products
GROUP BY lower(brand)
HAVING count(*) > 1;
Do the same thing on brands
..
SELECT lower(name), array_agg(name)
FROM brands
GROUP BY lower(name)
HAVING count(*) > 1;
For this update to work, that should return 0 row. Only after it returns 0 rows, can you actually run your update and fix this moving forward.
Next, if that's the kind of update your doing this is a case insensitive type. You should upgrade the type to using citext
. You should force the products table to refer to the brands table and then the change will permanently be reflected in your schema.
ALTER TABLE brands
ALTER COLUMN name
SET DATA TYPE citext;
ALTER TABLE product
ALTER COLUMN brand
SET DATA TYPE citext;
-
I will implement your suggestion and let you know yhe results. Thanks in advanceKrishna Rani Sahoo– Krishna Rani Sahoo2017年08月11日 17:56:26 +00:00Commented Aug 11, 2017 at 17:56
-
Hi @evan-carroll. I have removed the duplicate brand names from
brand
table andproducts
table. Tried to run alter command to modify the data type for brand names. It has given me the following errorActiveRecord::StatementInvalid: PG::UndefinedObject: ERROR: type "citext" does not exist
Krishna Rani Sahoo– Krishna Rani Sahoo2017年08月14日 10:38:52 +00:00Commented Aug 14, 2017 at 10:38 -
@krishna you need to run
Create extension citext;
firstEvan Carroll– Evan Carroll2017年08月14日 14:25:20 +00:00Commented Aug 14, 2017 at 14:25 -
Hi @evan-carroll. After removing the duplicate brand names, my query execution time has improved a lot. But I have no idea if I will create an extension called
citext
, then how could I define that the new data type will be case insensitive?Krishna Rani Sahoo– Krishna Rani Sahoo2017年08月18日 05:49:21 +00:00Commented Aug 18, 2017 at 5:49 -
That's a command. Just run it as a database superuser,
CREATE EXTENSION citext;
It's a core module in contrib.Evan Carroll– Evan Carroll2017年08月18日 06:00:09 +00:00Commented Aug 18, 2017 at 6:00
Your query output shows that a sequential scan is being done on both tables, which means that your indexes are not being used. Although you have indexed brands, you need to index lower(brand) and lower(brands.name). Because you are doing an operation on brands (lower) before filtering it means that the index can't be used. The other option if you don't want to index it would be to no longer use lower but one or the other should be done so that the index can be used.
The products table is also being filtered by products.created_at and you should consider a combo index for brand and created_at on that table.
When your database can't use the index it has to pull in all 10 million rows, convert them all to lower case, and then filter. Once you have the right indexes on both tables you will see the Seq Scan go away and it should speed up considerably because it will only pull the 1 million rows you are looking for.
-
Hi @indiri, I have modified my question with latest EXPLAIN plan after implementing combined index for lower(brand) and created_at of products table and brands table. Can you verify if the query plan will help in improving the SQL executionKrishna Rani Sahoo– Krishna Rani Sahoo2017年08月18日 12:22:34 +00:00Commented Aug 18, 2017 at 12:22
-
It's better now as products will be using the index. However brands is still not using it yet and looks to be using a sequential scan still. Do you have an index on brands for lower(brand) as well? If not it looks like you may want to add it.indiri– indiri2017年08月18日 23:32:14 +00:00Commented Aug 18, 2017 at 23:32
-
Hi @indiri, I have added index for lower(name) in brands table. Added the SQL statement for adding index for brands name in lower case.Krishna Rani Sahoo– Krishna Rani Sahoo2017年08月21日 05:18:39 +00:00Commented Aug 21, 2017 at 5:18
Explore related questions
See similar questions with these tags.
brands.lower(brand)
andproducts.created_at
? Can you post the DDL for those tables?brands
for each row inproducts
- note that before the join we have 300K rows inproducts
and 42k inbrands
- after the join, we have 65 million rows. Does that make sense, given what you know about the tables in question? Could there be differentbrands.id
values for the samelower(brands.name)
value?\d brands
and\d products
?