3
\$\begingroup\$

The following script changes start and end dates of several queries saved as files in the folder /app/queries. It is used by executing the script and passing two date strings as arguments, for example:

./config_query.sh 2022年03月28日 2022年03月30日

This would change all start dates of all queries in the folder /app/queries with the first parameter and all end dates with the second parameter.

I would like to get some feedback concerning the readability of the script. Especially the sed commands are very hard to read, which makes the script hard to maintain and extend. Is there a better way to change the start and end dates of the queries?

This is the script:

#!/usr/bin/env bash
#: Title : config_query.sh
#: Date : 2022年04月06日
#: Version : 1.0
#: Description : Change start and end dates of queries to values of parameters 1ドル and 2ドル.
#: Options : None
# Note: The script assumes that both date strings are valid dates in format YYYY-mm-dd.
# Date format validation is performed by the script.
# Date value validation is not performed by the script, i.e. valid dates have to be passed to generate consistent queries.
ROOT="/app"
QUERIES="${ROOT}/queries"
arr=( "${QUERIES}/leads.flux" "${QUERIES}/subs.flux" "${QUERIES}/gads.sql" "${QUERIES}/ms.sql" "${QUERIES}/li.sql" )
# Check number of arguments passed to script:
if [ "$#" -ne 2 ]; then
 echo "Illegal number of arguments. Program expects two date strings (YYYY-mm-dd)."
 exit $?
fi
START_DATE=1ドル
END_DATE=2ドル
# Check if start and end are formatted correctly:
for date in $START_DATE $END_DATE
do
 format="^[0-9]{4}-[0-9]{2}-[0-9]{2}$"
 if ! [[ $date =~ $format ]]
 then
 echo "Date $date is incorrectly formatted (not YYYY-mm-dd)."
 exit $?
 fi
done
START_DATE_DECR=$(date '+%Y-%m-%d' -d "$START_DATE - 1 day")
START_DATE_INCR=$(date '+%Y-%m-%d' -d "$START_DATE + 1 day")
END_DATE_INCR=$(date '+%Y-%m-%d' -d "$END_DATE + 1 day")
# Replaces 'start' and 'stop' in Influx leads query:
sed -i -E "s/start: .{4}-.{2}-.{2}T00:00:00.000Z, stop: .{4}-.{2}-.{2}/start: ${START_DATE}T00:00:00.000Z, stop: ${END_DATE}/g" "${arr[0]}"
# Replaces 'start' and 'stop' in Influx subs query:
sed -i -E "s/start: .{4}-.{2}-.{2}T00:00:00.000Z, stop: .{4}-.{2}-.{2}/start: ${START_DATE_DECR}T00:00:00.000Z, stop: ${END_DATE_INCR}/g" "${arr[1]}"
sed -i -E "s/start: .{4}-.{2}-.{2}T00:00:00.000Z, stop: .{4}-.{2}-.{2}.*delete.first.row/start: ${START_DATE_INCR}T00:00:00.000Z, stop: ${END_DATE_INCR}T00:00:00.000Z) \/\/ delete first row/g" "${arr[1]}"
# Replaces 'START_DATE' and 'END_DATE' in Bigquery queries:
for ((i=2; i<${#arr[@]}; i++))
do
 sed -i -E "s/\('.{4}-.{2}-.{2}', DATE_SUB\('.{4}-.{2}-.{2}',/\('$START_DATE', DATE_SUB\('$END_DATE',/g" "${arr[$i]}";
 sed -i -E "s/>= '.{4}-.{2}-.{2}'/>= '$START_DATE'/g" "${arr[$i]}";
 sed -i -E "s/< '.{4}-.{2}-.{2}'/< '$END_DATE'/g" "${arr[$i]}";
done

These are the queries as a reference (I don't seek feedback for the queries, I just included them for completeness):

/app/queries/gads.sql:

WITH dates AS (
 SELECT
 *
 FROM
 UNNEST(GENERATE_DATE_ARRAY('2022-03-11', DATE_SUB('2022-03-15', INTERVAL 1 DAY), INTERVAL 1 DAY)) AS date
), gads_data AS (
 SELECT DATE(__hevo_report_date) AS date,
 campaign,
 ad_group,
 sum(impressions) AS impressions,
 sum(clicks) AS clicks,
 ROUND(sum(cost)/1000000, 2) AS spend
 FROM `table1`
 WHERE DATE(__hevo_report_date) >= '2022-03-11' AND DATE(__hevo_report_date) < '2022-03-15'
 GROUP BY date, campaign, ad_group
)
SELECT dates.date, campaign, ad_group, impressions, clicks, spend
FROM dates
LEFT OUTER JOIN gads_data
ON dates.date = gads_data.date
ORDER BY dates.date
# li.sql
WITH dates AS (
 SELECT date
 FROM
 UNNEST(GENERATE_DATE_ARRAY('2022-03-11', DATE_SUB('2022-03-15', INTERVAL 1 DAY), INTERVAL 1 DAY)) AS date
), gads_data AS (
 SELECT DATE(TIMESTAMP_MILLIS(date)) AS date,
 campaign_id,
 sum(impressions) AS impressions,
 sum(clicks) AS clicks,
 sum(cost_in_local_currency) AS spend
 FROM `table2`
 WHERE DATE(TIMESTAMP_MILLIS(date)) >= '2022-03-11' AND DATE(TIMESTAMP_MILLIS(date)) < '2022-03-15'
 GROUP BY date, campaign_id
), names AS (
 SELECT id, name AS campaign
 FROM `ten-x-studio-data.threema.li_campaign`
)
SELECT dates.date, campaign, impressions, clicks, spend
FROM dates
LEFT OUTER JOIN gads_data
ON dates.date = gads_data.date
LEFT OUTER JOIN names
ON gads_data.campaign_id = names.id
ORDER BY dates.date

/app/queries/ms.sql:

WITH dates AS (
 SELECT
 *
 FROM
 UNNEST(GENERATE_DATE_ARRAY('2022-03-11', DATE_SUB('2022-03-15', INTERVAL 1 DAY), INTERVAL 1 DAY)) AS date
), ms_data AS (
 SELECT DATE(date) AS date,
 campaign_name AS campaign,
 ad_group_name AS ad_group,
 sum(impressions) AS impressions,
 sum(clicks) AS clicks,
 sum(spend) AS spend
 FROM `table3`
 WHERE DATE(date) >= '2022-03-11' AND DATE(date) < '2022-03-15'
 GROUP BY date, campaign_name, ad_group_name
)
SELECT dates.date, campaign, ad_group, impressions, clicks, spend
FROM dates
LEFT OUTER JOIN ms_data
ON dates.date = ms_data.date
ORDER BY dates.date

/app/queries/leads.flux:

from(bucket: "bucket1")
 |> range(start: 2022年03月11日T00:00:00.000Z, stop: 2022年03月15日T00:00:00.000Z)
 |> filter(fn: (r) => r["env"] == "productive")
 |> duplicate(as: "val", column: "_value")
 |> difference(columns: ["val"], nonNegative: true, keepFirst: true)
 |> fill(column: "val", value: 0.0)
 |> group(columns: ["_field", "path"])
 |> aggregateWindow(every: 1d, fn: sum, column: "val")
 |> yield(name: "mean")

app/queries/subs.flux:

// Get added values
a = from(bucket: "bucket2")
 |> range(start: 2022年03月10日T00:00:00.000Z, stop: 2022年03月16日T00:00:00.000Z)
 |> filter(fn: (r) => r["env"] == "productive")
 |> duplicate(column: "_value", as: "added_values")
 |> difference(columns: ["added_values"], nonNegative: true, keepFirst: true)
 |> fill(column: "added_values", value: 0.0)
 |> aggregateWindow(every: 1d, fn: sum, column: "added_values")
 |> map(fn:(r) => ({r with _time: string(v: r._time)}))
 |> drop(columns:["_measurement", "_start", "_stop", "env"])
// Get total values
b = from(bucket: "bucket2")
 |> range(start: 2022年03月10日T00:00:00.000Z, stop: 2022年03月16日T00:00:00.000Z)
 |> filter(fn: (r) => r["env"] == "productive")
 |> aggregateWindow(every: 1d, fn: last, column: "_value")
 |> map(fn:(r) => ({r with _time: string(v: r._time)}))
 |> drop(columns:["_measurement", "_start", "_stop", "env"])
// Get subtracted values
c = from(bucket: "bucket2")
 |> range(start: 2022年03月10日T00:00:00.000Z, stop: 2022年03月16日T00:00:00.000Z)
 |> filter(fn: (r) => r["env"] == "productive")
 |> duplicate(column: "_value", as: "sub_values")
 |> difference(columns: ["sub_values"], nonNegative: false, keepFirst: true)
 |> fill(column: "sub_values", value: 0.0)
 |> map(fn: (r) => ({r with sub_values: if r.sub_values < 0.0 then r.sub_values else 0.0}))
 |> aggregateWindow(every: 1d, fn: sum, column: "sub_values")
 |> map(fn:(r) => ({r with _time: string(v: r._time)}))
 |> drop(columns:["_measurement", "_start", "_stop", "env"])
temp = join(tables: {a:a, b:b},
 on: ["_field", "_time", "isTrial", "offer", "segment"]
)
res = join(tables: {temp:temp, c:c},
 on: ["_field", "_time", "isTrial", "offer", "segment"]
)
 |> map(fn: (r) => ({ r with _time: time(v: r._time) }))
 |> range(start: 2022年03月12日T00:00:00.000Z, stop: 2022年03月16日T00:00:00.000Z) // delete first row
 |> group()
 |> sort(columns: ["_time", "_field", "isTrial", "offer", "segment"], desc: false)
 |> yield()
```
asked Apr 6, 2022 at 14:37
\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

Generating files from template files

Replacing dates in files correctly can be tricky. It requires carefully crafting the pattern so that:

  • It correctly matches the values to replace.
  • It does not match anything else unintended.

I think a more robust alternative would be to define template files with clear placeholders that should be replaced with computed values. For example you could have placeholders in the form of {{NAME}}. This would make the pattern replacement simpler, generic (usable not only with date strings but with anything), and also reduce the risk of unintended matches.

Exiting with error

I'm wondering what you expected the exit code to be here:

if [ "$#" -ne 2 ]; then
 echo "Illegal number of arguments. Program expects two date strings (YYYY-mm-dd)."
 exit $?

$? stores the exit code of the last executed command, in this case the echo, which practically always succeeds. I suggest replacing this with exit 1 instead. Same thing at one more place in the code.

Always enclose variables in double-quotes used on the command line

This loop aims to validate the date variables:

for date in $START_DATE $END_DATE
do
 format="^[0-9]{4}-[0-9]{2}-[0-9]{2}$"
 if ! [[ $date =~ $format ]]
 then
 echo "Date $date is incorrectly formatted (not YYYY-mm-dd)."
 exit $?
 fi
done

If one of $START_DATE and $END_DATE is empty, the loop body will not run for it, and validation will be skipped.

If both are missing, the script will crash with a syntax error.

Use a validator function

Instead of looping over $START_DATE and $END_DATE, a validator function would be better:

validate_date() {
 # ...
}
validate_date "$START_DATE"
validate_date "$END_DATE"

Not only does this give a name to the validation logic, it would catch an invalid date even if you missed correctly double-quoting the variables, unlike when using a loop.

Reduce repeated overwriting of files

The other answer already recommended using the -e flag to pass multiple expressions to the same sed command. Another advantage of doing that is to reduce the repeated overwriting of the files. This might not make much difference in practice in this example; it's good to do anyway as a good practice.

And to make it really easy to read, I would format like this:

sed -i -E \
 -e "s/\('.{4}-.{2}-.{2}', DATE_SUB\('.{4}-.{2}-.{2}',/\('$START_DATE', DATE_SUB\('$END_DATE',/g" \
 -e "s/>= '.{4}-.{2}-.{2}'/>= '$START_DATE'/g" \
 -e "s/< '.{4}-.{2}-.{2}'/< '$END_DATE'/g" \
 "${arr[$i]}"

That is, each line here does either a single thing, or something easy to understand, and the file that will be affected is also easy to see because it stands out as the only thing on the line.

Use descriptive variable names

My main objection is about the first two elements of the array. ${arr[0]} and ${arr[1]} are not used in a loop, and I don't see a good reason to have these in the array. It would be better to give them descriptive names.

Also note that if arr contains only the names to use in the loop, then you can use a simpler classic iterating loop instead of a counting loop:

for name in "${arr[@]}"; do

And of course, it would be good to use a descriptive name instead of arr, for example sql_files or even just sqls.

Use brace expansion

Instead of repeating the ${QUERIES} prefix multiple times:

arr=( "${QUERIES}/leads.flux" "${QUERIES}/subs.flux" "${QUERIES}/gads.sql" "${QUERIES}/ms.sql" "${QUERIES}/li.sql" )

You could use brace expansion to reduce duplication:

arr=("${QUERIES}"/{leads.flux,subs.flux,gads.sql,ms.sql,li.sql})

Braces can also be nested, but beware to not overdo and hurt readability:

arr=("${QUERIES}"/{{leads,subs}.flux,{gads,ms,li}.sql})
Toby Speight
88.4k14 gold badges104 silver badges327 bronze badges
answered Aug 13, 2022 at 19:27
\$\endgroup\$
5
\$\begingroup\$

I have certainly seen worse shell scripts and calls to sed.

I'd split the sed scripts up with multiple -e flags instead of one long one. This will also let you use line continues more easily.

Ex:

sed -i -E -e "s/start: .{4}-.{2}-.{2}/start: ${START_DATE}/" \
 -e "s/stop: .{4}-.{2}-.{2}/stop: ${END_DATE}/" "${arr[0]}"

You also repeat the date regex enough that it's worth separating out into its own variable. I'd take that format var on line 28, rename it to something like DATE_FORMAT, make it readonly, and use it in the sed patterns.

Ex:

readonly DATE_FORMAT="[0-9]{4}-[0-9]{2}-[0-9]{2}"
sed -i -E -e "s/start: ${DATE_FORMAT}/start: ${START_DATE}/" \
 -e "s/stop: ${DATE_FORMAT}/stop: ${END_DATE}/" "${arr[0]}"
Reinderien
71.1k5 gold badges76 silver badges256 bronze badges
answered Apr 6, 2022 at 20:45
\$\endgroup\$
0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.