I have data stored in different monthly databases that have the exact same name schema for the tables and columns. Each monthly database has ~10 million rows. I'm looking for a more efficient way to select the rows I want from each database based on the same variable. There are only 5 to 10 rows from each month for that variable.
I also need to JOIN this different monthly data to the same separate database. I am currently using UNION to combine the data. So:
SELECT a.col1, a.col2, a.col3, b.colX, b.colY, b.colZ
FROM db_JAN.tab1 a
INNER JOIN db_OTHER.tabX b
ON (a.DATE = b.DATE AND a.someID = b.someID)
WHERE a.col1='variable1'
UNION
SELECT a.col1, a.col2, a.col3, b.colX, b.colY, b.colZ
FROM db_FEB.tab1 a
INNER JOIN db_OTHER.tabX b
ON (a.DATE = b.DATE AND a.someID = b.someID)
WHERE a.col1='variable1'
UNION
SELECT a.col1, a.col2, a.col3, b.colX, b.colY, b.colZ
FROM db_MARCH.tab1 a
INNER JOIN db_OTHER.tabX b
ON (a.DATE = b.DATE AND a.someID = b.someID)
WHERE a.col1='variable1'
.
.
.
If the SELECT, JOIN and WHERE clauses are the same across different databases with the same naming schema, is there a more efficient way of writing this query? Something like:
SELECT a.col1, a.col2, a.col3, b.colX, b.colY, b.colZ
FROM (db_JAN,db_FEB,db_MARCH...).tab1 a
INNER JOIN db_OTHER.tabX b
ON (a.DATE = b.DATE AND a.someID = b.someID)
WHERE a.col1='variable1'
This question might be applicable but I'm not sure as I'm relatively new to SQL and am not sure about sequential table scans or indexing:
"How to make a union view execute more efficiently?"
I don't believe it is a hierarchical query and this question doesn't help me either.
NOTE: I am using SAS to run the query connecting to Teradata SQL database and have used a MACRO VARIABLE for 'variable1' in my PROC SQL statement. I am not sure if I can use a MACRO VARIABLE for the databases but that could be an option if I can't do it in SQL itself.
2 Answers 2
Teradata supports predicates pushdown
select a.col1, a.col2, a.col3, b.colx, b.coly, b.colz
from ( select col1, col2, col3, date, someid from db_jan.tab1
union all select col1, col2, col3, date, someid from db_feb.tab1
union all ...
) a
join db_other.tabx b
on a.date = b.date
and a.someid = b.someid
where a.col1 = 'variable1'
-
It's actually shorter, only one
WHERE
. Mine repeats for each one of the 12 tablesypercubeᵀᴹ– ypercubeᵀᴹ2016年12月30日 19:57:58 +00:00Commented Dec 30, 2016 at 19:57 -
yep. sorry. just saw the
WHERE
at the end.xxCDxx– xxCDxx2016年12月30日 19:59:28 +00:00Commented Dec 30, 2016 at 19:59 -
Ran my original SQL, Dudu and TypoCube's multiple times with
UNION
&UNION ALL
. Times from SAS log and queries ran on server. Times for same query varied depending on server activity. I know it was a very amateur test but was good enough for me: - My original query was consistently slower for all. - Using theWHERE
clause inside the derived table didn't make noticeable difference so I left it outside for a shorter query per Dudu. -UNION ALL
was noticeable faster on all tests thanUNION
. Unfortunately I found duplicates that need to be deleted so have to useUNION
.xxCDxx– xxCDxx2016年12月30日 22:28:09 +00:00Commented Dec 30, 2016 at 22:28
There are various ways to write the query in shorter form. I'm not sure if they could improve performance (or make it worse!)
The best lines of attack if the query isn't efficient is to change it into UNION ALL
(provided that the results are equivalent) and examine/optimize the parts separately, make sure indexes are used, etc.
One way to rewrite is to move the UNION
or UNION ALL
inside a derived table and then join:
SELECT a.col1, a.col2, a.col3, b.colX, b.colY, b.colZ
FROM
( SELECT col1, col2, col3, DATE, someID
FROM db_JAN.tab1 WHERE col1 = 'variable1'
UNION ALL
SELECT col1, col2, col3, DATE, someID
FROM db_FEB.tab1 WHERE col1 = 'variable1'
UNION ALL
---
) a
INNER JOIN db_OTHER.tabX b
ON (a.DATE = b.DATE AND a.someID = b.someID) ;
-
Thanks @TypoCube. I'll amend and see if it improves or hinders performance. At least using the derived table will make the query a lot shorter since the initial
SELECT
is 60 lines of code which is what I wanted to avoid repeating.xxCDxx– xxCDxx2016年12月30日 19:54:53 +00:00Commented Dec 30, 2016 at 19:54 -
My advice is still to test. If the original query with UNION ALL preforms well, I don texpect my version to perform better. It might perform a bit better, it might perform worse. test them (and Dudu's query as well, it's a bit shorter than mine)ypercubeᵀᴹ– ypercubeᵀᴹ2016年12月30日 19:57:05 +00:00Commented Dec 30, 2016 at 19:57
-
see above for test results/comments.xxCDxx– xxCDxx2016年12月30日 22:11:55 +00:00Commented Dec 30, 2016 at 22:11
UNION
orUNION ALL
would be ok? (it's usually more efficient as it doesn't try to remove any duplicates)