I have encountered a scenario where the same query on a PostgreSQL database is exhibiting different index selection and join strategies between the QA and Prod environments. I'm trying to understand the possible reasons behind this behaviour.
Here are the details:
QA Environment:
- Smaller dataset compared to Prod
- Query uses nested loop join
- Query uses
idx_user_id_id_customer_id
index - List item
Prod Environment:
- Larger dataset compared to QA
- Query uses merge join
- Query uses
idx_customer_id
index - Size of
idx_user_id_customer_id
index is 118GB whileidx_customer_id
index is 85GB
Both environments have the same set of indexes. The main differences lie in the size of the data and the execution plans chosen by the query optimizer.
Prod explain log: https://explain.depesz.com/s/28la
QA explain log: https://explain.depesz.com/s/zM6e
1. What could be the possible reasons for the disparity in index selection and join strategy between the two environments?
2. Are there any specific factors that influence the optimizer's decision-making
process?
Here is what I think, please correct me If I am wrong and add more information:
- It's using nested loop join instead of merge join because on QA there might be fewer rows for the same record on one side.
- The index size of
idx_user_id_customer_id
is large that's why it is ignoring it. Or might be the selectivity ofuser_id
is low thancustomer_id
that's why it's pickingcustomer_id
1 Answer 1
- What could be the possible reasons for the disparity in index selection and join strategy between the two environments?
QA Environment: Smaller dataset compared to Prod
...
Prod Environment: Larger dataset compared to QA
That's the disparity, the different data, mostly the amount of rows difference.
- Are there any specific factors that influence the optimizer's decision-making process?
Yes, the size of data. Different data operations in the query plan are more efficient depending on the size of data being operated on. Nested Loops are typically more efficient for smaller sets of data being joined to. Merge Join is better for larger datasets.
It's tough maintaining multiple environments consistent enough to always get the same query plans for all queries, but to do so you'd have to maintain a pretty similar set of data across environments, in all of your relevant tables.
-
What's your take on difference between used indexes?sujeet– sujeet2023年07月09日 14:19:47 +00:00Commented Jul 9, 2023 at 14:19
-
1@sujeet Same reason as the rest of my answer. Index selection is based on what the SQL engine thinks is most efficient to use for that particular query based on the exact data and it's statistics. Different data can affect one index being more performant for scanning for example.J.D.– J.D.2023年07月11日 15:03:01 +00:00Commented Jul 11, 2023 at 15:03
customer
is different.