I have a rather long set of SQL which relies on CASE statements that gives me a percentage of the population of a particular zip code that fits a particular parameter. In the code below, it would give me the percentage of households with an elderly parent.
Now, the below code is actually created by a python script that generates it based on the list of selected factors and the possible values. In the case of one field there are 35 possible values which generates 35 different CASE statements. In the example below, homeowner status has three values which break out to three CASE statements but in fact it has 8 potential values...
The code runs. However, it takes a month of Sundays and I'm wondering if there is some way of optimizing it. My somewhat unhelpful SQL professor used to say-- "just let the query optimizer deal with it -- lots of smart people worked on that so you don't have to worry" ...
SELECT
DD_1409.ZIP_CODE,
count(*) RECORDS,
(SUM((CASE WHEN GENDER = 'F'
then 1
else 0 end)) / count(*)) * 100 GENDER_F,
(SUM((CASE WHEN PRESENCE_OF_ELDERLY_PARENT = 'Y'
then 1
else 0 end)) / count(*)) * 100 PRESENCE_ELDERLY_PARENT_Y,
MEDIAN(LENGTH_OF_RESIDENCE),
(SUM((CASE WHEN HOMEOWNER_STATUS = 'P'
then 1
else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_P,
(SUM((CASE WHEN HOMEOWNER_STATUS = 'R'
then 1
else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_R,
(SUM((CASE WHEN HOMEOWNER_STATUS = 'U'
then 1
else 0 end)) / count(*)) * 100 HOMEOWNER_STATUS_U,
(SUM((CASE WHEN MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7')
then 1
else 0 end)) / count(*)) * 100 GEN2_T
FROM MOPS.DD_1409
LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
WHERE STATE = 'CA'
AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;
I've tagged both Oracle and postgresql as it currently resides in Oracle but may be migrated to a postgresql instance in the next month.
-
5Optimizing expressions in the SELECT list won't noticeably improve query performance (unless you have subselects there, which you obviously don't). Look elsewhere; start with studying the query execution plan.mustaccio– mustaccio2018年05月04日 19:08:12 +00:00Commented May 4, 2018 at 19:08
-
1Agree with @mustaccio. These CASE statements are not what is causing the issue. Query execution plan is a good place to start. Are the tables indexed? Can the indexes be altered or improved? Specifically look for indexes or missing indexes on STATE, INDIVIDUAL_ID_NUMBER, MOPS.INDEX_GEN2.IID, ONE_PER_ADDRESS, DD_1409.ZIP_CODE.Shooter McGavin– Shooter McGavin2018年05月04日 20:31:39 +00:00Commented May 4, 2018 at 20:31
-
1Instead of count(*), can't you use an index column? Depending on some other factors, that might improve the performance.Bertrand Leroy– Bertrand Leroy2018年05月04日 22:36:16 +00:00Commented May 4, 2018 at 22:36
-
1Please show us your execution plan(s), table definitions and which indexes you have.Colin 't Hart– Colin 't Hart2018年05月04日 23:26:23 +00:00Commented May 4, 2018 at 23:26
-
1What percentage of your total data has state California, and "one per address" ? How many rows of data is your query supposed to be processing? If it's always millions of rows, there's not really going to be much way to make it faster save to investigate things like materialized views.Colin 't Hart– Colin 't Hart2018年05月04日 23:28:13 +00:00Commented May 4, 2018 at 23:28
2 Answers 2
In Postgresql, you can use the FILTER
clause as follows (untested):
SELECT
DD_1409.ZIP_CODE,
count(*) RECORDS,
count(*) filter (where GENDER = 'F') / count(*) * 100 GENDER_F,
count(*) filter (where PRESENCE_OF_ELDERLY_PARENT = 'Y') / count(*) * 100 PRESENCE_ELDERLY_PARENT_Y,
MEDIAN(LENGTH_OF_RESIDENCE),
count(*) filter (where HOMEOWNER_STATUS = 'P') / count(*) * 100 HOMEOWNER_STATUS_P,
count(*) filter (where HOMEOWNER_STATUS = 'R') / count(*) * 100 HOMEOWNER_STATUS_R,
count(*) filter (where HOMEOWNER_STATUS = 'U') / count(*) * 100 HOMEOWNER_STATUS_U,
count(*) filter (where MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7')) / count(*) * 100 GEN2_T
FROM MOPS.DD_1409
LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
WHERE STATE = 'CA'
AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;
But as others have mentioned, optimising expressions isn't going to help you; the Postgresql syntax is merely cleaner.
You'll need to show us your execution plan(s) and table definitions for us to be able to help you further.
In Postgresql at-least you can write those cases as casts from boolean to int. (I'm not sure how boolean expressions work in Oracle) But that will probably not speed up the query much.
The slow part is probably the join, not the aggregation of the results.
SELECT
DD_1409.ZIP_CODE,
count(*) RECORDS,
SUM(cast(GENDER = 'F' as int)) *100.0 / count(*) as GENDER_F,
SUM(cast(PRESENCE_OF_ELDERLY_PARENT = 'Y' as int)) *100.0 / count(*)
as PRESENCE_ELDERLY_PARENT_Y,
MEDIAN(LENGTH_OF_RESIDENCE),
SUM(cast(HOMEOWNER_STATUS = 'P' as int)) *100.0 / count(*)
as HOMEOWNER_STATUS_P,
SUM(cast(HOMEOWNER_STATUS = 'R' as int)) *100.0 / count(*)
as HOMEOWNER_STATUS_R,
SUM(cast(HOMEOWNER_STATUS = 'U' as int)) *100.0 / count(*)
as HOMEOWNER_STATUS_U,
SUM(cast(MOPS.KBMG_INDEX_GEN2.GEN2 in ('T1','T2','T3','T4','T5','T6','T7') as int)) *100.0 / count(*)
as GEN2_T
FROM MOPS.DD_1409
LEFT JOIN MOPS.INDEX_GEN2 on INDIVIDUAL_ID_NUMBER = MOPS.INDEX_GEN2.IID
WHERE STATE = 'CA'
AND ONE_PER_ADDRESS ='Y'
GROUP BY DD_1409.ZIP_CODE;
Explore related questions
See similar questions with these tags.