How to improve the SQL query or change the table structure in postgresql?

Question 1

I am using postgresql. I have a product table which has 10 million records. Everyday, I used to update around 1 million records. Here is the query which took around 7 hours and did not completed so I have to stop the query execution.

update products set clean_brand_id = brands.id from brands where lower(brand) = lower(brands.name) and products.created_at >= current_date

I tried to modify the query to update only 100 record, still it took so many time, this also failed. Here is the modified query.

update products set clean_brand_id = brands.id from brands where lower(brand) = lower(brands.name) and products.created_at >= current_date and products.id in (select id from products where products.created_at >= current_date and clean_brand_id is null limit 100) and clean_brand_id is null;

Can any one help to find out how to improve the query or modify the table structure for this. Index is implemented for brand and clean_brand_id.

There are around 50 columns and 15 indexes used in this table. The version of postgresql is 9.5

EDIT:

Here the result of EXPLAIN plan for the first query

 QUERY PLAN
----------------------------------------------------------------------------------------------
 Update on products (cost=2934598.99..4232174.12 rows=64819280 width=3220)
 -> Merge Join (cost=2934598.99..4232174.12 rows=64819280 width=3220)
 Merge Cond: ((lower((brands.name)::text)) = (lower((products.brand)::text)))
 -> Sort (cost=4095.37..4201.01 rows=42257 width=24)
 Sort Key: (lower((brands.name)::text))
 -> Seq Scan on brands (cost=0.00..848.57 rows=42257 width=24)
 -> Materialize (cost=2930503.62..2932037.55 rows=306786 width=3210)
 -> Sort (cost=2930503.62..2931270.58 rows=306786 width=3210)
 Sort Key: (lower((products.brand)::text))
 -> Seq Scan on products (cost=0.00..2053185.87 rows=306786 width=3210)
 Filter: (created_at >= ('now'::cstring)::date)
(11 rows)

Edit - 2

After removing the duplicate brand names from brand table and its reference from products table, here is the EXPLAIN plan for the SQL query

Update on products (cost=1442396.71..1443511.50 rows=1 width=3232)
 -> Hash Join (cost=1442396.71..1443511.50 rows=1 width=3232)
 Hash Cond: (lower((brands.name)::text) = lower((products.brand)::text))
 -> Seq Scan on brands (cost=0.00..848.57 rows=42257 width=24)
 -> Hash (cost=1442396.70..1442396.70 rows=1 width=3222)
 -> Nested Loop (cost=1441549.13..1442396.70 rows=1 width=3222)
 -> HashAggregate (cost=1441548.70..1441549.70 rows=100 wi
dth=32)
 Group Key: "ANY_subquery".id
 -> Subquery Scan on "ANY_subquery" (cost=66088.59..
1441548.45 rows=100 width=32)
 -> Limit (cost=66088.59..1441547.45 rows=100
width=4)
 -> Bitmap Heap Scan on products products
_1 (cost=66088.59..1936712.63 rows=136 width=4)
 Recheck Cond: (clean_brand_id IS NU
LL)
 Filter: (created_at >= ('now'::cstr
ing)::date)
 -> Bitmap Index Scan on index_prod
ucts_on_clean_brand_id (cost=0.00..66088.56 rows=1370683 width=0)
 Index Cond: (clean_brand_id I
S NULL)
 -> Index Scan using products_pkey on products (cost=0.43.
.8.46 rows=1 width=3194)
 Index Cond: (id = "ANY_subquery".id)
 Filter: ((clean_brand_id IS NULL) AND (created_at >=
('now'::cstring)::date))
(18 rows)

another factor for less time taken by the second EXPLAIN Plan is in the earlier execution number of new records are around .8 million, and in the second execution number of new records are much less(exact number not known)

Can anyone help me to find out if any of the existing indexes are utilized in the SQL query execution or not. If I can improve the execution of this query by creating any joint index on multiple columns.

Edit - 3

Added two new indexes on lower(brand) and created_at in products

CREATE INDEX products_lower_brand_created_at_idx ON products (created_at, lower(brand));
CREATE INDEX brands_lower_brand_idx ON products (lower(name));

Here is the EXPLAIN plan for the same query

Update on products (cost=7.97..435.50 rows=1 width=3187)
 -> Nested Loop Semi Join (cost=7.97..435.50 rows=1 width=3187)
 Join Filter: (products.id = "ANY_subquery".id)
 -> Hash Join (cost=7.54..426.80 rows=77 width=3159)
 Hash Cond: (lower((brands.name)::text) = lower((products.brand)::text))
 -> Seq Scan on brands (cost=0.00..321.76 rows=15476 width=24)
 -> Hash (cost=7.53..7.53 rows=1 width=3149)
 -> Index Scan using products_lower_brand_created_at_idx on products (cost=0.43..7.53 rows=1 width=3149)
 Index Cond: (created_at >= ('now'::cstring)::date)
 Filter: (clean_brand_id IS NULL)
 -> Materialize (cost=0.43..7.54 rows=1 width=32)
 -> Subquery Scan on "ANY_subquery" (cost=0.43..7.54 rows=1 width=32)
 -> Limit (cost=0.43..7.53 rows=1 width=4)
 -> Index Scan using products_lower_brand_created_at_idx on products products_1 (cost=0.43..7.53 rows=1 width=4)
 Index Cond: (created_at >= ('now'::cstring)::date)
 Filter: (clean_brand_id IS NULL)

These changes are implemented in local database and number of records are around .4 million. Can anyone help to findout if the new index can help in improving the query execution time

Question 2

Do you have indexes over brands.lower(brand) and products.created_at? Can you post the DDL for those tables?

Question 3

products.created_at, product.brand and brands.name are indexed. But the products.lower(brand) is not indexed

Question 4

It looks like there's more than one row in brands for each row in products - note that before the join we have 300K rows in products and 42k in brands - after the join, we have 65 million rows. Does that make sense, given what you know about the tables in question? Could there be different brands.id values for the same lower(brands.name) value?

Question 5

Can you output the result of \d brands and \d products?

Question 6

First, if products.brand, and brand.name are the same regardless of case then use citext. In order to get there, we have to find out where they collissions are. @RDFozz is right in the comments. We can figure out by analyzing both tables.

SELECT lower(brand), array_agg(brand)
FROM products
GROUP BY lower(brand)
HAVING count(*) > 1;

Do the same thing on brands..

SELECT lower(name), array_agg(name)
FROM brands
GROUP BY lower(name)
HAVING count(*) > 1;

For this update to work, that should return 0 row. Only after it returns 0 rows, can you actually run your update and fix this moving forward.

Next, if that's the kind of update your doing this is a case insensitive type. You should upgrade the type to using citext. You should force the products table to refer to the brands table and then the change will permanently be reflected in your schema.

ALTER TABLE brands
 ALTER COLUMN name
 SET DATA TYPE citext;
ALTER TABLE product
 ALTER COLUMN brand
 SET DATA TYPE citext;

Question 7

I will implement your suggestion and let you know yhe results. Thanks in advance

Question 8

Hi @evan-carroll. I have removed the duplicate brand names from brand table and products table. Tried to run alter command to modify the data type for brand names. It has given me the following error ActiveRecord::StatementInvalid: PG::UndefinedObject: ERROR: type "citext" does not exist

Question 9

@krishna you need to run Create extension citext; first

Question 10

Hi @evan-carroll. After removing the duplicate brand names, my query execution time has improved a lot. But I have no idea if I will create an extension called citext, then how could I define that the new data type will be case insensitive?

Question 11

That's a command. Just run it as a database superuser, CREATE EXTENSION citext; It's a core module in contrib.

Question 12

Your query output shows that a sequential scan is being done on both tables, which means that your indexes are not being used. Although you have indexed brands, you need to index lower(brand) and lower(brands.name). Because you are doing an operation on brands (lower) before filtering it means that the index can't be used. The other option if you don't want to index it would be to no longer use lower but one or the other should be done so that the index can be used.

The products table is also being filtered by products.created_at and you should consider a combo index for brand and created_at on that table.

When your database can't use the index it has to pull in all 10 million rows, convert them all to lower case, and then filter. Once you have the right indexes on both tables you will see the Seq Scan go away and it should speed up considerably because it will only pull the 1 million rows you are looking for.

Question 13

Hi @indiri, I have modified my question with latest EXPLAIN plan after implementing combined index for lower(brand) and created_at of products table and brands table. Can you verify if the query plan will help in improving the SQL execution

Question 14

It's better now as products will be using the index. However brands is still not using it yet and looks to be using a sequential scan still. Do you have an index on brands for lower(brand) as well? If not it looks like you may want to add it.

Question 15

Hi @indiri, I have added index for lower(name) in brands table. Added the SQL statement for adding index for brands name in lower case.

Evan Carroll Evan Carroll 65.7k50 gold badges259 silver badges510 bronze badges · Answer 1 · 2017-08-11 15:25:27Z

First, if products.brand, and brand.name are the same regardless of case then use citext. In order to get there, we have to find out where they collissions are. @RDFozz is right in the comments. We can figure out by analyzing both tables.

SELECT lower(brand), array_agg(brand)
FROM products
GROUP BY lower(brand)
HAVING count(*) > 1;

Do the same thing on brands..

SELECT lower(name), array_agg(name)
FROM brands
GROUP BY lower(name)
HAVING count(*) > 1;

For this update to work, that should return 0 row. Only after it returns 0 rows, can you actually run your update and fix this moving forward.

Next, if that's the kind of update your doing this is a case insensitive type. You should upgrade the type to using citext. You should force the products table to refer to the brands table and then the change will permanently be reflected in your schema.

ALTER TABLE brands
 ALTER COLUMN name
 SET DATA TYPE citext;
ALTER TABLE product
 ALTER COLUMN brand
 SET DATA TYPE citext;

I will implement your suggestion and let you know yhe results. Thanks in advance
Hi @evan-carroll. I have removed the duplicate brand names from brand table and products table. Tried to run alter command to modify the data type for brand names. It has given me the following error ActiveRecord::StatementInvalid: PG::UndefinedObject: ERROR: type "citext" does not exist
Hi @evan-carroll. After removing the duplicate brand names, my query execution time has improved a lot. But I have no idea if I will create an extension called citext, then how could I define that the new data type will be case insensitive?
That's a command. Just run it as a database superuser, CREATE EXTENSION citext; It's a core module in contrib.

indiri indiri 2,21110 silver badges12 bronze badges · Answer 2 · 2017-08-17 16:12:05Z

Your query output shows that a sequential scan is being done on both tables, which means that your indexes are not being used. Although you have indexed brands, you need to index lower(brand) and lower(brands.name). Because you are doing an operation on brands (lower) before filtering it means that the index can't be used. The other option if you don't want to index it would be to no longer use lower but one or the other should be done so that the index can be used.

The products table is also being filtered by products.created_at and you should consider a combo index for brand and created_at on that table.

When your database can't use the index it has to pull in all 10 million rows, convert them all to lower case, and then filter. Once you have the right indexes on both tables you will see the Seq Scan go away and it should speed up considerably because it will only pull the 1 million rows you are looking for.

Hi @indiri, I have modified my question with latest EXPLAIN plan after implementing combined index for lower(brand) and created_at of products table and brands table. Can you verify if the query plan will help in improving the SQL execution
It's better now as products will be using the index. However brands is still not using it yet and looks to be using a sequential scan still. Do you have an index on brands for lower(brand) as well? If not it looks like you may want to add it.
Hi @indiri, I have added index for lower(name) in brands table. Added the SQL statement for adding index for brands name in lower case.

Stack Exchange Network

How to improve the SQL query or change the table structure in postgresql?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to improve the SQL query or change the table structure in postgresql?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions