0

I have a large table join on the keys. The explain plan shows that most of the time is spent on sorting.

Is there anything I can do to prevent the large sort steps? Can I pre-sort the table?

Query and plan here : https://explain.dalibo.com/plan/15a6a91492g8g935#plan/node/5

Both tables already have an index on the key, like this, but that index does not seem to be used:

CREATE UNIQUE INDEX "stage_fact_revenue_900_idx_a" ON "stage"."stage_fact_revenue_900_add_dims" USING BTREE ("date","billed_relation_id","subject_client_id","identifier","revenue_type");
CREATE UNIQUE INDEX "fact_revenue_idx_a" ON "star"."fact_revenue" USING BTREE ("date","identifier","billed_relation_id","subject_client_id","revenue_type");
asked May 15, 2023 at 15:43
1
  • 3
    Could you please provide the DDL of the relevant tables and the EXPLAIN plan (as text). I think you require more indexes, but it's difficult to tell without further info! Commented May 15, 2023 at 16:50

1 Answer 1

0

An index speeds up the search if the returned rows are not more than roughly 10%, otherwise a full scan of the table is faster.

Postgresql chooses a full scan for stage_fact_revenue_900_add_dims because it needs to select all rows and it needs columns not present in the index. For the same reason, it thinks that it's better to scan fact_revenue once and sort the results rather than doing 177 millions index seeks.

Some things you could try:

  1. Add to your query an explicit ORDER BY date, identifier, billed_relation_id, subject_client_id, revenue_type this should suggest to Postgres that using the index on fact_revenue could avoid the sort.
  2. Change the first index to have the same fields order of the second (or add a new index in that order). At present they have the same fields but in different orders, so only one of them can actually be used to avoid the sort.
  3. Try rewriting the query using NOT EXISTS instead of a LEFT JOIN:

Like this:

select
 stage_fact_revenue_900_add_dims.date as date,
 stage_fact_revenue_900_add_dims.dim_date_key as dim_date_key,
 stage_fact_revenue_900_add_dims.identifier as identifier,
 stage_fact_revenue_900_add_dims.revenue_type as revenue_type,
 stage_fact_revenue_900_add_dims.product_id as product_id,
 stage_fact_revenue_900_add_dims.dim_product_key as dim_product_key,
 stage_fact_revenue_900_add_dims.billing_partner_id as billing_partner_id,
 stage_fact_revenue_900_add_dims.dim_billing_partner_key as dim_billing_partner_key,
 stage_fact_revenue_900_add_dims.billed_relation_id as billed_relation_id,
 stage_fact_revenue_900_add_dims.subject_client_id as subject_client_id,
 stage_fact_revenue_900_add_dims.dim_subject_client_key as dim_subject_client_key,
 stage_fact_revenue_900_add_dims.client_id as client_id,
 stage_fact_revenue_900_add_dims.dim_billed_client_key as dim_billed_client_key,
 stage_fact_revenue_900_add_dims.partner_id as partner_id,
 stage_fact_revenue_900_add_dims.dim_billed_partner_key as dim_billed_partner_key,
 stage_fact_revenue_900_add_dims.price as price,
 stage_fact_revenue_900_add_dims.duration as duration,
 stage_fact_revenue_900_add_dims.amount as amount,
 stage_fact_revenue_900_add_dims.leaf as leaf,
 stage_fact_revenue_900_add_dims.dss_record_source as dss_record_source,
 current_timestamp as dss_create_time,
 current_timestamp as dss_update_time
from
 stage.stage_fact_revenue_900_add_dims stage_fact_revenue_900_add_dims
 
where
 NOT EXISTS (SELECT 1 FROM star.fact_revenue fact_revenue WHERE stage_fact_revenue_900_add_dims.date = fact_revenue.date
 and stage_fact_revenue_900_add_dims.identifier = fact_revenue.identifier
 and stage_fact_revenue_900_add_dims.revenue_type = fact_revenue.revenue_type
 and stage_fact_revenue_900_add_dims.billed_relation_id = fact_revenue.billed_relation_id
 and stage_fact_revenue_900_add_dims.subject_client_id = fact_revenue.subject_client_id)
ORDER BY date, identifier, billed_relation_id, subject_client_id, revenue_type;
answered May 16, 2023 at 11:23
1
  • Hi @andrea-b, thank you or your extensive answer. I tried all things. resulted in a relatively small gain. In the end I found out that the main time absorber was something else: the query was used to insert data into a table, and that table contained 17 indexes..... Pruning the amount of indexes was the best performance gain. Commented Jun 9, 2023 at 11:05

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.