I’ve written this bash backup script. It uses the --link-dest
option of rsync
; that way, the user have access to the backed data at any time stamped with relatively affordable data overhead. Any duplicated data should be hard linked; the overhead mostly come from the directory structure.
It’s mostly based on this very nice guide by Mike Rubel and various other contributors, as well as a few answers from Unix SE and other web reference.
The script is meant to be run at regular (typically, hourly) intervals with cron
and other scripts are in charge of safely keeping daily/weekly backups.
Of course, I want to minimize the size of the backups. I also want to delete older backups before newer ones. To do so, I build ${backups}
, an array of every backup1 sorted by modification date from newest to oldest. Thus ${backups[0]}
(if it exists) is the latest complete backup and ${backups[@:$n]}
for some integer $n
lists every backup but the $n
newest (from 0 to $n
- 1).
As always with bash scripts, I’m especially afraid of quoting issues, but any remark is welcome.
I in particular quite dislike how I use find
with both -mindepth
and maxdepth
, but couldn’t find any way around it.
Most of the "standard" commands, such as cut
, sort
or grep
, are provided by BusyBox 1.16.1 and may not have every option available on most recent Linux distribution. cut
, in particular, does not understand the -d
option, hence the ugly tr
trick.
#!/bin/bash
# Check if we are root (no one else should run this)
# ==================================================
if (( $(id -u) != 0 )); then
echo "/ ! \ Only root can run this script. Backup cancelled." >&2
exit 3
fi
# Functions
# =========
# Given a date (or a placeholder), returns the corresponding hourly backup name
function scheme {
local token="$@"
echo "hourly ${token}"
}
# Parameters
# ==========
# password_file=/etc/backup/passwd # Network yet untested
backup_directory=/path/to/backups # ABSOLUTE PATH required
source_directory=/path/to/data
backup_count=24 # number of backups to keep
# Name new daily backup
# =====================
new_backup="${backup_directory}"/$(scheme $(date +"%-d-%m-%Y a %Hh%M") )
# Check that we can run
# =====================
# Check that we don’t overwrite anything
if [[ -e "${new_backup}" ]]; then
echo "/ ! \ ${new_backup} already exists! We don’t want to overwrite it; backup cancelled." >&2
exit 4
fi
# Create the directory which contains all the backups if it doesn’t exist yet
if [[ ! -e "${backup_directory}" ]]; then
echo "Creating directory ${backup_directory}"
mkdir -p "${backup_directory}"
elif [[ ! -d "${backup_directory}" ]]; then
echo "/ ! \ Destination ${backup_directory} already exists but is not a directory! Backup cancelled." >&2
exit 4
fi
# Create a temporary working directory
# ====================================
temp_backup=$(mktemp -d -p "${backup_directory}")
# Manage previous backups
# =======================
# List every previous backup and put it into an array
backups=()
while read -r -d ''; do
backups+=("${REPLY}")
done < <( find "${backup_directory}" -mindepth 1 -maxdepth 1 -name "$(scheme \*)" -printf "%A@:%p0円" | \
sort -z -t: -n -r | \
tr '\n0円' '0円\n' | cut -d: -f2 - | tr '\n0円' '0円\n' \
)
# If it exists, select the latest backup as a reference for rsync --link-dest
if (( ${#backups[@]} > 0 )); then
latest_backup="${backups[0]}"
else
latest_backup=""
fi
# Compute the backups to remove
# We add one backup before cleaning up
# Thus, we keep $backup_count - 1 from the ${backups[@]}
old_backups=("${backups[@]:${backup_count} - 1}")
# Cleanup function
# ================
# We now have everything we need to define a cleanup function
# It will be called only if the backup succeeds
function cleanup {
echo
echo "Cleaning up"
echo "==========="
echo
if (( ${#old_backups[@]} > 0 )); then
echo "Deleting ${#old_backups[@]} backup(s)!"
echo
# echo rm -rf "${old_backups[@]}"
(set -x; rm -rf "${old_backups[@]}")
else
echo "There is nothing to delete."
fi
}
# User feedback
# =============
echo "Backing up ${source_directory}"
echo "Backing up ${source_directory}" | sed "s/./=/g"
echo
echo "New backup: ${new_backup}"
# Setting up rsync options
# ========================
RSYNC_FLAGS=("--archive" "--stats")
# Set rsync --password-file if the matching variable is defined and
# we are using rsync (::) **YET UNTESTED**
if [[ "${password_file}" != "" && "${source_directory}" =~ "::" ]]; then
RSYNC_FLAGS+=("--password-file=${password_file}")
fi
# Use rsync to backup. If a previous backup exists,
# uses --link-dest to hard link to it.
if [[ "${latest_backup}" != "" ]]; then
echo "Previous backup: ${latest_backup}"
RSYNC_FLAGS+=("--link-dest=${latest_backup}")
else
echo "This is the first backup ever, it might take a while."
fi
echo
# Backing-up
# ==========
# TODO Check if something was actually written before creating a new backup
# TODO Add an exclusion file
(set -x; rsync "${RSYNC_FLAGS[@]}" "${source_directory}" "${temp_backup}") && \
(set -x; mv "${temp_backup}" "${new_backup}") && cleanup
echo
- Actually the "name of the repository which contains the backup", of course.
2 Answers 2
The script is nicely written. I only have minor suggestions that are barely more than nitpicks.
Function declaration style
Instead of this:
function scheme {
The generally preferred style for declaring functions is this:
scheme() {
Redundant local variable
The local variable token
is redundant here:
function scheme { local token="$@" echo "hourly ${token}" }
You could simplify to:
echo "hourly $@"
Simplify condition
This condition can be simplified:
if (( ${#backups[@]} > 0 )); then latest_backup="${backups[0]}" else latest_backup="" fi
To just this:
latest_backup="${backups[0]}"
Instead of this:
if [[ "${password_file}" != "" ]]; then
You can omit the != ""
:
if [[ "${password_file}" ]]; then
Don't repeat yourself
The echo
statement is duplicated for the sake of underlining:
echo "Backing up ${source_directory}" echo "Backing up ${source_directory}" | sed "s/./=/g"
It would be good to create a helper function for this purpose:
print_heading() {
echo "$@"
echo "$@" | sed "s/./=/g"
}
This looks exceptionally good. But per your request, I see a few improvement possibilities...
Backup file name
The scheme()
function is not necessary unless you need it to do several other operations not shown.
The 'Command Substitution' used to build the string should also be within the quotes to avoid unexpected interpretation by the shell.
Spaces in filenames require total accuracy in quoting to keep straight, which is the most confusing part of bash scripting, so your life will be a lot easier if you can avoid them.
Note too that each Command Substition $(...)
is a new context. So we can use double-quotes within them without escaping. Don't be confused by the IDE reversing the colors at each level. That's just the way they work.
So these lines...
backup_directory="/path/to/backups"
:
new_backup="${backup_directory}"/$(scheme $(date +"%-d-%m-%Y a %Hh%M") )
Would be more reliable like this...
backup_directory="/path/to/backups"
:
new_backup="${backup_directory}/hourly_$(date +"%Y-%m-%d_a_%Hh%M")"
Running this snippet and echoing $new_backup
gives me...
path/to/backups/hourly_2016年05月03日_a_05h26
Alternative to find
A better solution here relies on two features of bash that are not well understood...
Pathname Expansion - Pattern Matching
Wildcard expansion is done by bash before it is sent to any command preceding it. We thus don't needfind
orls
or anything else to get a list of the files in a directory. If we need the full path though, we do need to prefix it on each one with something like printf.printf applies format to all arguments
Printf has an odd feature that's just the thing we need here. From the manpage...The format is reused as necessary to consume all of the arguments.
Printf will reuse the format string on each filename returned by Pathname Expansion.
Thus this code...
backups=()
while read -r -d ''; do
backups+=("${REPLY}")
done < <( find "${backup_directory}" -mindepth 1 -maxdepth 1 -name "$(scheme \*)" -printf "%A@:%p0円" | \
sort -z -t: -n -r | \
tr '\n0円' '0円\n' | cut -d: -f2 - | tr '\n0円' '0円\n' \
)
Could be replaced with...
backups=()
while read -r -d ''; do
backups+=("${REPLY}")
done < <( printf "%s\n" "${backup_directory}"/* | sort -r )
The input to the while
loop should look something like this...
> printf "%s\n" "${backup_directory}"/* | sort -r
path/to/backups/hourly_2016年05月03日_a_05h29
path/to/backups/hourly_2016年05月03日_a_05h26
path/to/backups/hourly_2016年05月03日_a_05h25
-
\$\begingroup\$ This wasn’t clear in the original question so I edited it in: I want
${backup}
to contain the names of the backups sorted by from newest to oldest. I could use the file names to do that if I used a date s.t. alphabetical order is chronological order by reversing ${backup}, but I want to allow more user friendly names. I tried usingstat --printf
kind of like yourprintf
, but its behaviour with no argument is not to do nothing. \$\endgroup\$Édouard– Édouard2016年04月30日 14:17:53 +00:00Commented Apr 30, 2016 at 14:17 -
\$\begingroup\$ I've modified the date format and added sort -r to present old backups in reverse order as requested. \$\endgroup\$DocSalvager– DocSalvager2016年05月03日 09:33:00 +00:00Commented May 3, 2016 at 9:33
find
when there is no other way. The syntax is just f***ing bizarre to me and thus highly error-prone. Good job on thorough use of double-quotes. A|
alone will also do line continuation in bash. \$\endgroup\$find
is very welcome \$\endgroup\$${backup}
is meaningful at that wasn’t clear in the previous version. \$\endgroup\$