1

I have a rather long set of SQL which relies on CASE statements that gives me a percentage of the population of a particular zip code that fits a particular parameter. In the code below, it would give me the percentage of households with an elderly parent.

Now, the below code is actually created by a python script that generates it based on the list of selected factors and the possible values. In the case of one field there are 35 possible values which generates 35 different CASE statements. In the example below, homeowner status has three values which break out to three CASE statements but in fact it has 8 potential values...

The code runs. However, it takes a month of Sundays and I'm wondering if there is some way of optimizing it. My somewhat unhelpful SQL professor used to say-- "just let the query optimizer deal with it -- lots of smart people worked on that so you don't have to worry" ...

SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 (SUM((CASE WHEN GENDER = 'F'
 then 1
 else 0 end)) / count(*)) * 100 GENDER_F,
 (SUM((CASE WHEN PRESENCE_OF_ELDERLY_PARENT = 'Y'
 then 1
 else 0 end)) / count(*)) * 100 PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 (SUM((CASE WHEN HOMEOWNER_STATUS = 'P'
 then 1
 else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_P,
 (SUM((CASE WHEN HOMEOWNER_STATUS = 'R'
 then 1
 else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_R,
 (SUM((CASE WHEN HOMEOWNER_STATUS = 'U'
 then 1
 else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_U,
 (SUM((CASE WHEN MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7')
 then 1
 else 0 end)) / count(*)) * 100 GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE; 

I've tagged both Oracle and postgresql as it currently resides in Oracle but may be migrated to a postgresql instance in the next month.

asked May 4, 2018 at 18:52
5
  • 5
    Optimizing expressions in the SELECT list won't noticeably improve query performance (unless you have subselects there, which you obviously don't). Look elsewhere; start with studying the query execution plan. Commented May 4, 2018 at 19:08
  • 1
    Agree with @mustaccio. These CASE statements are not what is causing the issue. Query execution plan is a good place to start. Are the tables indexed? Can the indexes be altered or improved? Specifically look for indexes or missing indexes on STATE, INDIVIDUAL_ID_NUMBER, MOPS.INDEX_GEN2.IID, ONE_PER_ADDRESS, DD_1409.ZIP_CODE. Commented May 4, 2018 at 20:31
  • 1
    Instead of count(*), can't you use an index column? Depending on some other factors, that might improve the performance. Commented May 4, 2018 at 22:36
  • 1
    Please show us your execution plan(s), table definitions and which indexes you have. Commented May 4, 2018 at 23:26
  • 1
    What percentage of your total data has state California, and "one per address" ? How many rows of data is your query supposed to be processing? If it's always millions of rows, there's not really going to be much way to make it faster save to investigate things like materialized views. Commented May 4, 2018 at 23:28

2 Answers 2

4

In Postgresql, you can use the FILTER clause as follows (untested):

SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 count(*) filter (where GENDER = 'F') / count(*) * 100 GENDER_F,
 count(*) filter (where PRESENCE_OF_ELDERLY_PARENT = 'Y') / count(*) * 100 PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 count(*) filter (where HOMEOWNER_STATUS = 'P') / count(*) * 100 HOMEOWNER_STATUS_P,
 count(*) filter (where HOMEOWNER_STATUS = 'R') / count(*) * 100 HOMEOWNER_STATUS_R,
 count(*) filter (where HOMEOWNER_STATUS = 'U') / count(*) * 100 HOMEOWNER_STATUS_U,
 count(*) filter (where MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7')) / count(*) * 100 GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;

But as others have mentioned, optimising expressions isn't going to help you; the Postgresql syntax is merely cleaner.

You'll need to show us your execution plan(s) and table definitions for us to be able to help you further.

answered May 4, 2018 at 23:20
2

In Postgresql at-least you can write those cases as casts from boolean to int. (I'm not sure how boolean expressions work in Oracle) But that will probably not speed up the query much.

The slow part is probably the join, not the aggregation of the results.

 SELECT
 DD_1409.ZIP_CODE,
 count(*) RECORDS,
 SUM(cast(GENDER = 'F' as int)) *100.0 / count(*) as GENDER_F,
 SUM(cast(PRESENCE_OF_ELDERLY_PARENT = 'Y' as int)) *100.0 / count(*)
 as PRESENCE_ELDERLY_PARENT_Y,
 MEDIAN(LENGTH_OF_RESIDENCE),
 SUM(cast(HOMEOWNER_STATUS = 'P' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_P,
 SUM(cast(HOMEOWNER_STATUS = 'R' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_R,
 SUM(cast(HOMEOWNER_STATUS = 'U' as int)) *100.0 / count(*)
 as HOMEOWNER_STATUS_U,
 SUM(cast(MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7') as int)) *100.0 / count(*)
 as GEN2_T
FROM MOPS.DD_1409
 LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
 WHERE STATE = 'CA'
 AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE; 
answered May 4, 2018 at 21:42

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.