What I want to achieve:
I have several tables of the following schema:
+------+-------+
| date | val |
+------+-------+
| DATE | INT64 |
+------+-------+
I want to create the following:
+--------+---------+----------+----------------+--------------+
| table | val_cur | val_prev | val_last_month | val_expected |
+--------+---------+----------+----------------+--------------+
| STRING | INT64 | INT64 | INT64 | INT64 |
+--------+---------+----------+----------------+--------------+
where table
is the current table, val_cur
is the sum of values of the current timeframe (e.g. '2021-12-01' - '2021-12-09'), val_prev
the sum of values of the previous timeframe (e.g. '2021-11-01' - '2021-11-09'), val_last_month
the sum of values of last month (relative to the current timeframe) and val_expected
the sum of the average values per day of the current timeframe times the number of days left of the current month. I'm using the Standard SQL dialect of Google BigQuery.
What I'm trying to optimize
I would like to make the query less verbose and less error-prone to changes. Furthermore, it would be nice to make the query as efficient as possible concerning memory and execution speed.
My Solution
This is what I have - first I create some useful date variables:
DECLARE START_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-01'); -- Start of current timeframe
DECLARE END_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-09'); -- End of current timeframe
DECLARE PREV_START_DATE DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE PREV_END_DATE DATE DEFAULT DATE_SUB(END_DATE, INTERVAL 1 MONTH);
DECLARE FIRST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE LAST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY);
DECLARE LAST_DAY_CUR_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY);
DECLARE DAYS_PASSED INT64 DEFAULT DATE_DIFF(END_DATE, START_DATE, DAY)+1;
DECLARE DAYS_LEFT INT64 DEFAULT (EXTRACT(DAY FROM LAST_DAY_LAST_MONTH) - DAYS_PASSED) + 1;
SELECT DISTINCT START_DATE, END_DATE, PREV_START_DATE, PREV_END_DATE, DAYS_PASSED, DAYS_LEFT, FIRST_DAY_LAST_MONTH, LAST_DAY_LAST_MONTH, LAST_DAY_CUR_MONTH
FROM (SELECT 0);
Output:
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
| Row | START_DATE | END_DATE | PREV_START_DATE | PREV_END_DATE | DAYS_PASSED | DAYS_LEFT | FIRST_DAY_LAST_MONTH | LAST_DAY_LAST_MONTH |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
| 1 | 2021年12月01日 | 2021年12月09日 | 2021年11月01日 | 2021年11月09日 | 9 | 22 | 2021年11月01日 | 2021年11月30日 |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
Then I create two example tables and execute my code:
WITH table_1 AS
(SELECT date, CAST(RAND()*10 AS INT64) AS val
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS date),
table_2 AS
(SELECT date, CAST(RAND()*10 AS INT64) AS val
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS date)
(SELECT
'table_1' AS table,
(SELECT SUM(val)
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,
(SELECT SUM(val)
FROM table_1
WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,
(SELECT SUM(val)
FROM table_1
WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,
(SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)
UNION ALL
(SELECT
'table_2' AS table,
(SELECT SUM(val)
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,
(SELECT SUM(val)
FROM table_1
WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,
(SELECT SUM(val)
FROM table_1
WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,
(SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)
Output:
-- Replicating the example on your machine will change values because of RAND()
+---------+---------+----------+----------------+--------------------+
| table | val_cur | val_prev | val_last_month | val_expected |
+---------+---------+----------+----------------+--------------------+
| table_1 | 64 | 50 | 168 | 107.55555555555557 |
| table_2 | 32 | 48 | 169 | 114.88888888888889 |
+---------+---------+----------+----------------+--------------------+
Thanks for any feedback!
1 Answer 1
To reduce verbosity and the many subqueries, consider conditional aggregation where you move WHERE
conditions into CASE
statements nested within aggregate functions.
Additionally, run your dates query as another CTE to CROSS JOIN
with main tables. Try also the BETWEEN
operator for range conditions.
DECLARE START_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-01'); -- Start of current timeframe
DECLARE END_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-09'); -- End of current timeframe
WITH dates AS (
SELECT START_DATE,
END_DATE,
DATE_SUB(START_DATE, INTERVAL 1 MONTH) AS PREV_START_DATE,
DATE_SUB(END_DATE, INTERVAL 1 MONTH) AS PREV_END_DATE,
DATE_SUB(START_DATE, INTERVAL 1 MONTH) AS FIRST_DAY_LAST_MONTH,
DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY) AS LAST_DAY_LAST_MONTH,
DATE_SUB(DATE_TRUNC(
DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH),
MONTH), INTERVAL 1 DAY) AS LAST_DAY_CUR_MONTH,
DATE_DIFF(END_DATE, START_DATE, DAY) + 1 AS DAYS_PASSED,
(EXTRACT(DAY FROM DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY)) -
DATE_DIFF(END_DATE, START_DATE, DAY) + 1) + 1 AS DAYS_LEFT
FROM (SELECT 0)
), table_1 AS (
SELECT date, CAST(RAND()*10 AS INT64) AS val
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS date
), table_2 AS (
SELECT date, CAST(RAND()*10 AS INT64) AS val
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS date
)
SELECT
'table_1' AS table,
SUM(CASE
WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
THEN t.val
END) AS val_cur,
SUM(CASE
WHEN t.date BETWEEN d.PREV_START_DATE AND d.PREV_END_DATE
THEN t.val
END) AS val_prev,
SUM(CASE
WHEN t.date BETWEEN d.FIRST_DAY_LAST_MONTH AND d.LAST_DAY_LAST_MONTH
THEN t.val
END) AS val_last_month,
(SUM(CASE
WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
THEN t.val
END) /
d.DAYS_PASSED) * d.DAYS_LEFT AS val_expected
FROM table_1 t
CROSS JOIN dates d
GROUP BY d.DAYS_PASSED, d.DAYS_LEFT
UNION ALL
SELECT
'table_2' AS table,
SUM(CASE
WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
THEN t.val
END) AS val_cur,
SUM(CASE
WHEN t.date BETWEEN d.PREV_START_DATE AND d.PREV_END_DATE
THEN t.val
END) AS val_prev,
SUM(CASE
WHEN t.date BETWEEN d.FIRST_DAY_LAST_MONTH AND d.LAST_DAY_LAST_MONTH
THEN t.val
END) AS val_last_month,
(SUM(CASE
WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
THEN t.val
END) /
d.DAYS_PASSED) * d.DAYS_LEFT AS val_expected
FROM table_2 t
CROSS JOIN dates d
GROUP BY d.DAYS_PASSED, d.DAYS_LEFT
ORDER BY 1
Output
Row | table | val_cur | val_prev | val_last_month | val_expected |
---|---|---|---|---|---|
1 | table_1 | 35 | 38 | 143 | 93.333333333333329 |
2 | table_2 | 64 | 62 | 165 | 170.66666666666666 |
FROM table_2
in subqueries of secondSELECT
inUNION
query? \$\endgroup\$