Optimization of multiple CASE Statements

Question 1

I have a rather long set of SQL which relies on CASE statements that gives me a percentage of the population of a particular zip code that fits a particular parameter. In the code below, it would give me the percentage of households with an elderly parent.

Now, the below code is actually created by a python script that generates it based on the list of selected factors and the possible values. In the case of one field there are 35 possible values which generates 35 different CASE statements. In the example below, homeowner status has three values which break out to three CASE statements but in fact it has 8 potential values...

The code runs. However, it takes a month of Sundays and I'm wondering if there is some way of optimizing it. My somewhat unhelpful SQL professor used to say-- "just let the query optimizer deal with it -- lots of smart people worked on that so you don't have to worry" ...

SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 (SUM((CASE WHEN GENDER = 'F'
 then 1
 else 0 end)) / count(*)) * 100 GENDER_F,
 (SUM((CASE WHEN PRESENCE_OF_ELDERLY_PARENT = 'Y'
 then 1
 else 0 end)) / count(*)) * 100 PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 (SUM((CASE WHEN HOMEOWNER_STATUS = 'P'
 then 1
 else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_P,
 (SUM((CASE WHEN HOMEOWNER_STATUS = 'R'
 then 1
 else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_R,
 (SUM((CASE WHEN HOMEOWNER_STATUS = 'U'
 then 1
 else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_U,
 (SUM((CASE WHEN MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7')
 then 1
 else 0 end)) / count(*)) * 100 GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;

I've tagged both Oracle and postgresql as it currently resides in Oracle but may be migrated to a postgresql instance in the next month.

Question 2

Optimizing expressions in the SELECT list won't noticeably improve query performance (unless you have subselects there, which you obviously don't). Look elsewhere; start with studying the query execution plan.

Question 3

Agree with @mustaccio. These CASE statements are not what is causing the issue. Query execution plan is a good place to start. Are the tables indexed? Can the indexes be altered or improved? Specifically look for indexes or missing indexes on STATE, INDIVIDUAL_ID_NUMBER, MOPS.INDEX_GEN2.IID, ONE_PER_ADDRESS, DD_1409.ZIP_CODE.

Question 4

Instead of count(*), can't you use an index column? Depending on some other factors, that might improve the performance.

Question 5

Please show us your execution plan(s), table definitions and which indexes you have.

Question 6

What percentage of your total data has state California, and "one per address" ? How many rows of data is your query supposed to be processing? If it's always millions of rows, there's not really going to be much way to make it faster save to investigate things like materialized views.

Question 7

In Postgresql, you can use the FILTER clause as follows (untested):

SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 count(*) filter (where GENDER = 'F') / count(*) * 100 GENDER_F,
 count(*) filter (where PRESENCE_OF_ELDERLY_PARENT = 'Y') / count(*) * 100 PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 count(*) filter (where HOMEOWNER_STATUS = 'P') / count(*) * 100 HOMEOWNER_STATUS_P,
 count(*) filter (where HOMEOWNER_STATUS = 'R') / count(*) * 100 HOMEOWNER_STATUS_R,
 count(*) filter (where HOMEOWNER_STATUS = 'U') / count(*) * 100 HOMEOWNER_STATUS_U,
 count(*) filter (where MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7')) / count(*) * 100 GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;

But as others have mentioned, optimising expressions isn't going to help you; the Postgresql syntax is merely cleaner.

You'll need to show us your execution plan(s) and table definitions for us to be able to help you further.

Question 8

In Postgresql at-least you can write those cases as casts from boolean to int. (I'm not sure how boolean expressions work in Oracle) But that will probably not speed up the query much.

The slow part is probably the join, not the aggregation of the results.

 SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 SUM(cast(GENDER = 'F' as int)) *100.0 / count(*) as GENDER_F,
 SUM(cast(PRESENCE_OF_ELDERLY_PARENT = 'Y' as int)) *100.0 / count(*)
 as PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 SUM(cast(HOMEOWNER_STATUS = 'P' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_P,
 SUM(cast(HOMEOWNER_STATUS = 'R' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_R,
 SUM(cast(HOMEOWNER_STATUS = 'U' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_U,
 SUM(cast(MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7') as int)) *100.0 / count(*)
 as GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;

Colin 't Hart Colin 't Hart 9,51015 gold badges37 silver badges44 bronze badges · Answer 1 · 2018-05-04 23:20:40Z

In Postgresql, you can use the FILTER clause as follows (untested):

SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 count(*) filter (where GENDER = 'F') / count(*) * 100 GENDER_F,
 count(*) filter (where PRESENCE_OF_ELDERLY_PARENT = 'Y') / count(*) * 100 PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 count(*) filter (where HOMEOWNER_STATUS = 'P') / count(*) * 100 HOMEOWNER_STATUS_P,
 count(*) filter (where HOMEOWNER_STATUS = 'R') / count(*) * 100 HOMEOWNER_STATUS_R,
 count(*) filter (where HOMEOWNER_STATUS = 'U') / count(*) * 100 HOMEOWNER_STATUS_U,
 count(*) filter (where MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7')) / count(*) * 100 GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;

But as others have mentioned, optimising expressions isn't going to help you; the Postgresql syntax is merely cleaner.

You'll need to show us your execution plan(s) and table definitions for us to be able to help you further.

Jasen Jasen 3,6661 gold badge15 silver badges17 bronze badges · Answer 2 · 2018-05-04 21:42:14Z

In Postgresql at-least you can write those cases as casts from boolean to int. (I'm not sure how boolean expressions work in Oracle) But that will probably not speed up the query much.

The slow part is probably the join, not the aggregation of the results.

 SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 SUM(cast(GENDER = 'F' as int)) *100.0 / count(*) as GENDER_F,
 SUM(cast(PRESENCE_OF_ELDERLY_PARENT = 'Y' as int)) *100.0 / count(*)
 as PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 SUM(cast(HOMEOWNER_STATUS = 'P' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_P,
 SUM(cast(HOMEOWNER_STATUS = 'R' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_R,
 SUM(cast(HOMEOWNER_STATUS = 'U' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_U,
 SUM(cast(MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7') as int)) *100.0 / count(*)
 as GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;

Stack Exchange Network

Optimization of multiple CASE Statements

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimization of multiple CASE Statements

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions