Optimizing a query by adding an index

Question 1

Here are my tables:

accounts_company:
 - id
 ...
accounts_companyview:
 - account_id: FK to company
 - viewer_id: FK to company
shipments_address:
 - companyview_id: FK to companyview
 - is_shipper: boolean (not null)
accounts_similarcompanyview
 - original_id: FK to companyview
 - similar_id: FK to companyview

This is my query (generated by Django)

SELECT COUNT(*) AS "__count"
 FROM "accounts_similarcompanyview"
 INNER JOIN "accounts_companyview"
 ON ("accounts_similarcompanyview"."original_id" = "accounts_companyview"."id")
 INNER JOIN "shipments_address"
 ON ("accounts_companyview"."id" = "shipments_address"."company_view_id")
 INNER JOIN "accounts_companyview" T5
 ON ("accounts_similarcompanyview"."similar_id" = T5."id")
 INNER JOIN "shipments_address" T6
 ON (T5."id" = T6."company_view_id")
 WHERE ("accounts_similarcompanyview"."are_similar" = true AND "shipments_address"."is_shipper" = true AND "accounts_companyview"."viewer_id" = 51729 AND T6."is_shipper" = true AND T5."viewer_id" = 51729);

Output of EXPLAIN ANALYZE: https://explain.depesz.com/s/slJK

I can see that there is an Index Scan taking most of the time, so I tried to add a few indexes but no change whatsoever, I still get a full index scan.

CREATE INDEX "shipments_a_is_ship_d78ee4_idx" ON "shipments_address" ("is_shipper");

CREATE INDEX "shipments_a_is_ship_d78ee4_idx" ON "shipments_address" ("is_shipper", "company_view_id");

What index could I add to make this query faster?

EDIT: forgot to add the size of the tables

select count(*) from shipments_address;
 2998765
select count(*) from accounts_company;
 168224
select count(*) from accounts_companyview;
 371560
select count(*) from accounts_similarcompanyview;
 83434

Question 2

Could you please enable IO statistics: SET STATISTICS IO ON; and post the message from SSMS after running your query with it enabled? And since you're working with views it's possible that something on the view code (that was made for a different purpose ohter than the result of this specific query) is causing the scan.

Question 3

@Ronaldo: You are talking MS SQL Server, but this is about Postgres.

Question 4

Sorry for my lack of attention, you're right. But I believe my point about the view having code that was meant to attend another query being the reason of your scans still stands. I don't believe Postgres differs from SQL Server on that point. Was the view created for this specific query?

Question 5

In PostgreSQL, that is spelled track_io_timing.

Question 6

@Ronaldo I don't think I'm working with views, accounts_companyview is the name of the table. Or maybe I misunderstood something?

Question 7

The main problem are the bad estimates of the join cardinalities that lead PostgreSQL to use a nested loop join when a hash join would perform better.

There is one simple thing you can do to reduce the impact of the outermost nested loop join:

CREATE INDEX ON shipments_address (company_view_id) WHERE is_shipper;

That should cut the execution time roughly in half.

Other than that, I can think of the ugly method of temporarily disabling nested loop joins:

BEGIN;
SET LOCAL enable_nestloop = off;
/* your query */
COMMIT;

Question 8

Thanks, that seemed to make it a lot faster (from 75s to 15s) on my staging db: explain.depesz.com/s/p1CL

Question 9

You are not getting a full index scan. You are getting a regular (partial or parameterized) index scan, repeated 52,190 times. Each individual scan is very fast (I'm not sure how much it can be improved) but when you do it that many times it adds up.

It might be faster to do a hash join against that table rather than the nested loop index scan. It depends on how big the table is. You can drop the index "shipments_address_company_view_id_1119f23e" and see what happens. Or you could turn up random_page_cost (just in this one session) and see if that forces the switch, and whether it is better once it does.

CREATE INDEX "shipments_a_is_ship_d78ee4_idx" ON "shipments_address" ("is_shipper", "company_view_id");

I would expect that that index would be used. Although I wouldn't expect using it to improve your performance dramatically--at most 50%, certainly not 10 fold. Also, I would suggest swapping the order of the columns. Either way would work for this particular query, but the putting the boolean last will probably make it usable for a greater variety of queries and allow you to drop the index on (company_view_id) alone.

Perhaps you can drop the other index and see if this one gets used then. If you don't have a test/QA database setup and don't want to really drop the index in production, you can drop it inside a transaction, do the EXPLAIN and then roll it back. Other users will be locked out of your table from the DROP until the ROLLBACK, but if you put all the commands into a file, or put them all on the command line on one line, and if do a plain EXPLAIN rather than an EXPLAIN ANALYZE, this should be a small fraction of a second.

Question 10

Thanks, I ran it on my staging DB with the index reversed: "company_view_id", "is_shipper" and it took it from 75s to 16s: explain.depesz.com/s/yDXq

Laurenz Albe Laurenz Albe 62k4 gold badges57 silver badges93 bronze badges · Answer 1 · 2019-10-21 19:51:34Z

The main problem are the bad estimates of the join cardinalities that lead PostgreSQL to use a nested loop join when a hash join would perform better.

There is one simple thing you can do to reduce the impact of the outermost nested loop join:

CREATE INDEX ON shipments_address (company_view_id) WHERE is_shipper;

That should cut the execution time roughly in half.

Other than that, I can think of the ugly method of temporarily disabling nested loop joins:

BEGIN;
SET LOCAL enable_nestloop = off;
/* your query */
COMMIT;

Thanks, that seemed to make it a lot faster (from 75s to 15s) on my staging db: explain.depesz.com/s/p1CL

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Answer 2 · 2019-10-21 20:03:58Z

You are not getting a full index scan. You are getting a regular (partial or parameterized) index scan, repeated 52,190 times. Each individual scan is very fast (I'm not sure how much it can be improved) but when you do it that many times it adds up.

It might be faster to do a hash join against that table rather than the nested loop index scan. It depends on how big the table is. You can drop the index "shipments_address_company_view_id_1119f23e" and see what happens. Or you could turn up random_page_cost (just in this one session) and see if that forces the switch, and whether it is better once it does.

CREATE INDEX "shipments_a_is_ship_d78ee4_idx" ON "shipments_address" ("is_shipper", "company_view_id");

I would expect that that index would be used. Although I wouldn't expect using it to improve your performance dramatically--at most 50%, certainly not 10 fold. Also, I would suggest swapping the order of the columns. Either way would work for this particular query, but the putting the boolean last will probably make it usable for a greater variety of queries and allow you to drop the index on (company_view_id) alone.

Perhaps you can drop the other index and see if this one gets used then. If you don't have a test/QA database setup and don't want to really drop the index in production, you can drop it inside a transaction, do the EXPLAIN and then roll it back. Other users will be locked out of your table from the DROP until the ROLLBACK, but if you put all the commands into a file, or put them all on the command line on one line, and if do a plain EXPLAIN rather than an EXPLAIN ANALYZE, this should be a small fraction of a second.

Thanks, I ran it on my staging DB with the index reversed: "company_view_id", "is_shipper" and it took it from 75s to 16s: explain.depesz.com/s/yDXq

Stack Exchange Network

Optimizing a query by adding an index

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing a query by adding an index

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions