1
\$\begingroup\$

What I want to achieve:

I have several tables of the following schema:

+------+-------+
| date | val |
+------+-------+
| DATE | INT64 |
+------+-------+

I want to create the following:

+--------+---------+----------+----------------+--------------+
| table | val_cur | val_prev | val_last_month | val_expected |
+--------+---------+----------+----------------+--------------+
| STRING | INT64 | INT64 | INT64 | INT64 |
+--------+---------+----------+----------------+--------------+

where table is the current table, val_cur is the sum of values of the current timeframe (e.g. '2021-12-01' - '2021-12-09'), val_prev the sum of values of the previous timeframe (e.g. '2021-11-01' - '2021-11-09'), val_last_month the sum of values of last month (relative to the current timeframe) and val_expected the sum of the average values per day of the current timeframe times the number of days left of the current month. I'm using the Standard SQL dialect of Google BigQuery.

What I'm trying to optimize

I would like to make the query less verbose and less error-prone to changes. Furthermore, it would be nice to make the query as efficient as possible concerning memory and execution speed.

My Solution

This is what I have - first I create some useful date variables:

DECLARE START_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-01'); -- Start of current timeframe
DECLARE END_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-09'); -- End of current timeframe
DECLARE PREV_START_DATE DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE PREV_END_DATE DATE DEFAULT DATE_SUB(END_DATE, INTERVAL 1 MONTH);
DECLARE FIRST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE LAST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY);
DECLARE LAST_DAY_CUR_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY);
DECLARE DAYS_PASSED INT64 DEFAULT DATE_DIFF(END_DATE, START_DATE, DAY)+1;
DECLARE DAYS_LEFT INT64 DEFAULT (EXTRACT(DAY FROM LAST_DAY_LAST_MONTH) - DAYS_PASSED) + 1;
SELECT DISTINCT START_DATE, END_DATE, PREV_START_DATE, PREV_END_DATE, DAYS_PASSED, DAYS_LEFT, FIRST_DAY_LAST_MONTH, LAST_DAY_LAST_MONTH, LAST_DAY_CUR_MONTH
FROM (SELECT 0);

Output:

+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
| Row | START_DATE | END_DATE | PREV_START_DATE | PREV_END_DATE | DAYS_PASSED | DAYS_LEFT | FIRST_DAY_LAST_MONTH | LAST_DAY_LAST_MONTH |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
| 1 | 2021年12月01日 | 2021年12月09日 | 2021年11月01日 | 2021年11月09日 | 9 | 22 | 2021年11月01日 | 2021年11月30日 |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+

Then I create two example tables and execute my code:

WITH table_1 AS
 (SELECT date, CAST(RAND()*10 AS INT64) AS val
 FROM UNNEST(
 GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
 ) AS date),
 table_2 AS
 (SELECT date, CAST(RAND()*10 AS INT64) AS val
 FROM UNNEST(
 GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
 ) AS date)
(SELECT 
 'table_1' AS table,
 (SELECT SUM(val)
 FROM table_1
 WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,
 (SELECT SUM(val)
 FROM table_1
 WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,
 (SELECT SUM(val)
 FROM table_1
 WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,
 (SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
 FROM table_1
 WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)
UNION ALL 
(SELECT
 'table_2' AS table,
 (SELECT SUM(val)
 FROM table_1
 WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,
 (SELECT SUM(val)
 FROM table_1
 WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,
 (SELECT SUM(val)
 FROM table_1
 WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,
 (SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
 FROM table_1
 WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)

Output:

-- Replicating the example on your machine will change values because of RAND()
+---------+---------+----------+----------------+--------------------+
| table | val_cur | val_prev | val_last_month | val_expected |
+---------+---------+----------+----------------+--------------------+
| table_1 | 64 | 50 | 168 | 107.55555555555557 |
| table_2 | 32 | 48 | 169 | 114.88888888888889 |
+---------+---------+----------+----------------+--------------------+

Thanks for any feedback!

asked Feb 2, 2022 at 13:22
\$\endgroup\$
1
  • \$\begingroup\$ Did you mean FROM table_2 in subqueries of second SELECT in UNION query? \$\endgroup\$ Commented May 15, 2022 at 16:50

1 Answer 1

2
\$\begingroup\$

To reduce verbosity and the many subqueries, consider conditional aggregation where you move WHERE conditions into CASE statements nested within aggregate functions.

Additionally, run your dates query as another CTE to CROSS JOIN with main tables. Try also the BETWEEN operator for range conditions.

DECLARE START_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-01'); -- Start of current timeframe
DECLARE END_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-09'); -- End of current timeframe
WITH dates AS (
 SELECT START_DATE, 
 END_DATE,
 DATE_SUB(START_DATE, INTERVAL 1 MONTH) AS PREV_START_DATE,
 DATE_SUB(END_DATE, INTERVAL 1 MONTH) AS PREV_END_DATE,
 DATE_SUB(START_DATE, INTERVAL 1 MONTH) AS FIRST_DAY_LAST_MONTH,
 DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY) AS LAST_DAY_LAST_MONTH,
 DATE_SUB(DATE_TRUNC(
 DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), 
 MONTH), INTERVAL 1 DAY) AS LAST_DAY_CUR_MONTH,
 DATE_DIFF(END_DATE, START_DATE, DAY) + 1 AS DAYS_PASSED,
 (EXTRACT(DAY FROM DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY)) - 
 DATE_DIFF(END_DATE, START_DATE, DAY) + 1) + 1 AS DAYS_LEFT
 FROM (SELECT 0)
), table_1 AS (
 SELECT date, CAST(RAND()*10 AS INT64) AS val
 FROM UNNEST(
 GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
 ) AS date
), table_2 AS (
 SELECT date, CAST(RAND()*10 AS INT64) AS val
 FROM UNNEST(
 GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
 ) AS date
)
SELECT 
 'table_1' AS table,
 SUM(CASE 
 WHEN t.date BETWEEN d.START_DATE AND d.END_DATE 
 THEN t.val
 END) AS val_cur,
 SUM(CASE
 WHEN t.date BETWEEN d.PREV_START_DATE AND d.PREV_END_DATE
 THEN t.val
 END) AS val_prev,
 SUM(CASE
 WHEN t.date BETWEEN d.FIRST_DAY_LAST_MONTH AND d.LAST_DAY_LAST_MONTH
 THEN t.val
 END) AS val_last_month,
 (SUM(CASE
 WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
 THEN t.val
 END) / 
 d.DAYS_PASSED) * d.DAYS_LEFT AS val_expected
FROM table_1 t
CROSS JOIN dates d
GROUP BY d.DAYS_PASSED, d.DAYS_LEFT
UNION ALL
SELECT 
 'table_2' AS table,
 SUM(CASE 
 WHEN t.date BETWEEN d.START_DATE AND d.END_DATE 
 THEN t.val
 END) AS val_cur,
 SUM(CASE
 WHEN t.date BETWEEN d.PREV_START_DATE AND d.PREV_END_DATE
 THEN t.val
 END) AS val_prev,
 SUM(CASE
 WHEN t.date BETWEEN d.FIRST_DAY_LAST_MONTH AND d.LAST_DAY_LAST_MONTH
 THEN t.val
 END) AS val_last_month,
 (SUM(CASE
 WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
 THEN t.val
 END) / 
 d.DAYS_PASSED) * d.DAYS_LEFT AS val_expected
FROM table_2 t
CROSS JOIN dates d
GROUP BY d.DAYS_PASSED, d.DAYS_LEFT
ORDER BY 1

Output

Row table val_cur val_prev val_last_month val_expected
1 table_1 35 38 143 93.333333333333329
2 table_2 64 62 165 170.66666666666666
answered May 15, 2022 at 17:15
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.