I would like to ask for a review of the following Bash script which cleans up and datestamps file names, with an optional feature to rename the file.
The main use case is renaming files downloaded from Overleaf, so that Angebot_produkt_kunde-2.pdf
(resulting from the second download from the project
without moving the file Angebot_produkt_kunde.pdf
out of the way first) gets renamed to Angebot_produkt_kunde-2-2024年06月10日_1248.pdf
.
If the file name contains a datestamp, it is updated if the string to append is date
.
Another feature is appending custom suffixes, as in
angebot_produkt_Müller-Maßnahmen-5 "Neuer Entwurf" -> Angebot-produkt-mueller-massnahmen-5-neuer-entwurf
In this example, the file name has no extension.
I want to make sure that the file remains in the directory it is in and to skip files whose names start with a dot. Files without extension or with more than one extension are handled correctly.
The file names are also cleaned up so they will not cause problems when archived or moved between file systems. Umlauts and spaces are removed, as are leading, trailing, and excessive internal dots or hyphens. Finally, the new name is in lowercase.
Find test cases at the end of this question.
Specific review questions:
Can this script be written more concisely? I am particularly unhappy about
sed -E -e 's/(^-|([\.-]*)$)//g' \ -e 's/-\././g'
which is needed to get to the new name
Txt.pdf
instead ofTxt-.pdf
when called as follows:~/bin/newname.sh --printonly '/tmp/@#$.txt.pdf' '!'
Is there an accepted name for the base name of a file, but without any extension? That is, in the name
/tmp/data.csv
, the string/tmp
is the directory name, the stringdata.csv
is the base name, the string.csv
is the extension, but what is a good name for the stringdata
on its own?Should I rename the option
--printonly
to-d
(abbreviating "dry-run")? Is anything off with my handling of options?Have I missed any obvious edge cases for testing? (A non-obvious edge case would be a file name consisting entirely of Hanzi characters, for example. Such a case will not occur in the environment the script is expected to run in.)
What about locales? I am a first-time Cygwin user and am mildly surprised to see
echo "$LC_ALL"
not produce any output. Should I replaceA-Za-z0-9
with[[:alnum:]]
, or are there any disadvantages to that?
Test cases:
#!/bin/bash
{
~/bin/newname.sh --printonly Angebot_produkt_kunde_Maßnahmen-6_2024年06月27日_1620.pdf date
~/bin/newname.sh --printonly angebot_produkt_Müller-Maßnahmen-4__2024年06月27日_1806.txt.pdf date
~/bin/newname.sh --printonly angebot_produkt_Müller-Maßnahmen-5 "Neuer Entwurf"
~/bin/newname.sh --printonly angebot_produkt_Müller-Maßnahmen-3.pdf "Neuer Entwurf"
~/bin/newname.sh --printonly "Angebot 5.pdf" " Neuer Entwurf "
~/bin/newname.sh --printonly Report_kunde_17_06_2024____23_06_2024___LuaLaTeX.pdf date
~/bin/newname.sh --printonly ~/angebot_produkt_Müller-Maßnahmen-3.pdf "Neuer Entwurf"
~/bin/newname.sh --printonly /tmp/files/"Angebot 5.pdf" " Neuer Entwurf "
~/bin/newname.sh --printonly /tmp/files/"Angebot 5%% Rabatt.pdf" " Neuer Entwurf "
~/bin/newname.sh --printonly .configfile date
~/bin/newname.sh --printonly /tmp/.configfile date
~/bin/newname.sh --printonly Angebot_produkt_kunde-2.pdf date
~/bin/newname.sh --printonly '/tmp/@#$%.txt' '!'
~/bin/newname.sh --printonly '/tmp/@#$.txt.pdf' '!'
~/bin/newname.sh --printonly '/tmp/@#$*' '!'
} | sed -e "s/$USER/user/g" | column -s';' -t
Output:
New filename without renaming: Angebot-produkt-kunde-massnahmen-6-2024年06月28日-1644.pdf dir=.
New filename without renaming: Angebot-produkt-mueller-massnahmen-4-2024年06月28日-1644.txt.pdf dir=.
New filename without renaming: Angebot-produkt-mueller-massnahmen-5-neuer-entwurf dir=.
New filename without renaming: Angebot-produkt-mueller-massnahmen-3-neuer-entwurf.pdf dir=.
New filename without renaming: Angebot-5-neuer-entwurf.pdf dir=.
New filename without renaming: Report-kunde-17-06-2024-23-06-2024-lualatex-2024年06月28日-1644.pdf dir=.
New filename without renaming: Angebot-produkt-mueller-massnahmen-3-neuer-entwurf.pdf dir=/cygdrive/c/Users/user
New filename without renaming: Angebot-5-neuer-entwurf.pdf dir=/tmp/files
New filename without renaming: Angebot-5-prozent-rabatt-neuer-entwurf.pdf dir=/tmp/files
Will not rename dotfile .configfile
Will not rename dotfile /tmp/.configfile
New filename without renaming: Angebot-produkt-kunde-2-2024年06月28日-1644.pdf dir=.
New filename without renaming: Prozent.txt dir=/tmp
New filename without renaming: Txt.pdf dir=/tmp
New filename without renaming: Wtf dir=/tmp
The script:
#!/bin/bash
# Clean up and datestamp file names; optionally rename the file
# Thure Dührsen, 2024年06月28日
# Environment: Cygwin
# Function to modify the base name of the file
clean_name() {
if [ "$#" -ne 1 ]; then
echo "Usage: clean_name <filename>"
exit 1
fi
echo "1ドル" | sed -E -e 's/(Ä|ä)/AE/g' \
-e 's/(Ö|ö)/OE/g' \
-e 's/(Ü|ü)/UE/g' \
-e 's/ß/SS/g' \
-e 's/%+/-Prozent-/g' \
-e 's/[[:space:]]/-/g' |
tr -dc 'A-Za-z .0-9_-' |
tr '_' '-' |
tr -s '-' |
sed -E -e 's/(^-|([\.-]*)$)//g' \
-e 's/-\././g' |
tr '[:upper:]' '[:lower:]'
}
# Function to append string before the final dot or at the end if no dot exists
append_string_to_filename() {
if [ "$#" -ne 2 ]; then
echo "Usage: append_string_to_filename <filename> <string>"
exit 1
fi
local filename="$(clean_name "1ドル")"
local append_string="$(clean_name "2ドル")"
local extension=""
local base_name=""
if [[ "$filename" =~ ([^.]+)(\..*)$ ]]; then
base_name="${BASH_REMATCH[1]}"
extension="${BASH_REMATCH[2]}"
else
base_name="$filename"
fi
local new_base_name="${base_name%.*}"'_'"${append_string}""${extension}"
local clean_base_name="$(clean_name "$new_base_name")"
# Clean_base_name might no longer contain alphanumeric characters
if [[ ! "$clean_base_name" =~ ^[A-Za-z0-9]+([-_A-Za-z.0-9]*[A-Za-z0-9])?$ ]]; then
clean_base_name='wtf'"$extension"
fi
echo "${clean_base_name^}"
}
if [[ "1ドル" == "--printonly" ]]; then
printonly=true
shift
else
printonly=false
fi
if [[ "$#" -lt 2 || "$#" -gt 3 ]]; then
echo "Usage: 0ドル [--printonly] <file> string_to_append"
exit 1
fi
if [ "$printonly" == "false" ]
then
if [ ! -e "1ドル" ]
then
echo "file 1ドル does not exist"
exit 2
fi
fi
fullpath="1ドル"
dir="$(dirname "$fullpath")"
file_name="$(basename -- "$fullpath")"
string_to_append="2ドル"
if [[ "$file_name" =~ ^\. ]]
then
echo "Will not rename dotfile $fullpath"
exit 2
fi
if [[ "$string_to_append" == "date" ]]; then
string_to_append="$(date '+%F_%H%M')"
file_name="$(echo "$file_name" | sed -E -e 's/_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}_[[:digit:]]{4}//')"
fi
new_file_name="$(append_string_to_filename "$file_name" "$string_to_append")"
if [ "$printonly" == "false" ]
then
if ! mv -n -v "$dir"/"$file_name" "$dir"/"$new_file_name"
then
echo "Failed to rename file $fullpath"
exit 2
fi
else
echo "New filename without renaming: $new_file_name ; dir=$dir"
fi
2 Answers 2
use case
The two main features you describe are
- uniquifying "same" name filespecs
- case smashing and otherwise regularizing the character set
I'm glad to see the second one broken out as a nice clean_name
helper function.
There is maybe a lost opportunity when we rename
Angebot_produkt_kunde-2.pdf
toAngebot_produkt_kunde-2-2024年06月10日_1248.pdf
It seems desirable to strip that now-meaningless -2-
in the middle.
It makes me sad that Müller-Maßnahmen
is harder
to deal with in this day and age than Mueller-Massnahmen
.
Apparently there is yet more infrastructure work to be done.
current time vs mtime
Consider using stat -c %y $FILE
when obtaining a timestamp that will become part of the filename.
character class
more concisely?
sed -E -e 's/(^-|([\.-]*)$)//g'
Well, within a character class there's no need to \
backwhack
the .
dot:
sed -E -e 's/(^-|([.-]*)$)//g'
And it's unclear why you'd want that inner (
... )
capturing group.
name minus extension
in
data.csv
, a good name fordata
?
I recommend stem for this line:
base_name="${BASH_REMATCH[1]}"
option synonyms
Should I rename the option --printonly to -d
Sure, go for it.
Other utilities use that familiar option name,
though often as -n
.
A -d
at first glance looks like --debug
.
Do be sure to offer --dry-run
, as well.
Also, consider setting LC_ALL yourself.
I assume that [[:alnum:]]
will sometimes match ß + umlaut vowels.
extra character match
I don't understand this "delete, complement" expression:
tr -dc 'A-Za-z .0-9_-' |
There's a SPACE in the middle, after "z".
But we already nuked all [[:space:]]
characters
when we turned them into -
dash.
I'm reading this as enumerating "characters we wish to preserve", so the SPACE seems misleading.
Consider moving this tr
into the "umlaut" sed
invocation:
tr '_' '-' |
The tr --squeeze
is very nice.
anonymous
clean_base_name='wtf'"$extension"
Consider adopting the name prefix anon
instead.
single responsibility
The mainline code
- cracks argv,
- computes a new name, and
- either prints new name or interacts with filesystem
Consider breaking out helpers, which the mainline calls.
Also, we have a perfectly nice ! -e "1ドル"
check
for whether file exists, which appears too early in the flow,
and therefore needs an extra $printonly
conditional.
argument order
Accepting "filespec date" seems less convenient than accepting "date filespec".
The script seems to really want to have xargs invoke it:
find /some/where -type f -name '*.pdf' | xargs newname.sh date
And then filename(s) get tacked on the end of argv.
As written, the OP code would need xargs -n 1
,
but bolting on a for
loop is easy once you have a few helpers.
Given some of the crazy whitespace characters you need to strip,
find -print0
/ xargs -0
might also be needed.
-
\$\begingroup\$ Thank you very much for the thorough review. As for the removal of the version numbers
Angebot_produkt_kunde-2-2024年06月10日_1248.pdf -> Angebot_produkt_kunde-2024年06月10日_1248.pdf
I find that GNU sed does not support positive lookahead, so I might settle for something likeecho Angebot_produkt_kunde-2-2024年06月10日_1248.pdf | sed -E 's/(.*)(-[[:digit:]]+)(-[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{4})/1円3円/'
I will definitely take your other suggestions into account as well and, in a few days, probably ask a new question for the rewritten version. \$\endgroup\$Thure Dührsen– Thure Dührsen2024年06月29日 14:27:35 +00:00Commented Jun 29, 2024 at 14:27 -
\$\begingroup\$ I had in mind something simpler, less about regexes and more about filesystem tests. Match a "-2" or a "-\d" suffix, whatever. If we find foo.pdf and foo-2.pdf, then assume the "multiple download versions" scenario you mentioned, strip "-2", and replace it with date. I suppose that’s one level above the OP code, up at the level of iterating across all entries in a directory. \$\endgroup\$J_H– J_H2024年06月29日 14:58:17 +00:00Commented Jun 29, 2024 at 14:58
-
\$\begingroup\$ Yes, though I am realising this is far from foolproof. Considering
Angebot-5.pdf
, it is unclear whether this is the fifth offer or the fifth revision ofAngebot.pdf
. Meanwhile, the requirements are changing in other ways I will have to consult my pillow on. \$\endgroup\$Thure Dührsen– Thure Dührsen2024年07月02日 07:40:18 +00:00Commented Jul 2, 2024 at 7:40
Regarding transliteration of German characters, you could use iconv which is quite probably installed on your system. It is a convenient tool to normalize Unicode strings and convert from one charset to another.
So:
echo 'Müller-Maßnahmen' | iconv -f UTF-8 -t ASCII//TRANSLIT
Muller-Massnahmen
This is a simplistic example but it does the trick. There are many more options and iconv should be locale-aware as well (or even locale-dependent actually). As always, check the docs.
In fact the matter (Unicode transliteration) is quite difficult and the most reasonable approach is to use dedicated tools like iconv for the job. This will also make the code more flexible (not limited to one language subset).
In fact, this works in French too:
echo 'œuf de Pâques' | iconv -f UTF-8 -t ASCII//TRANSLIT
oeuf de Paques
with the usual disclaimer that "it works on my machine/locale".
-
\$\begingroup\$ Kate, "use
iconv
!" is valuable advice so I gave it +1. But it's not quite right for ä, ö, ü. On a 1970's American typewriter the standard was to hit two keystrokes for those: ae, oe, ue. (And not mess around with backspace + " double-quote.) So when you pick up a box of Mueller's pasta, it's perfectly clear the surname is Müller. Iconv is doing reasonable things with the example characters you mention. But for those three German vowels, the OPsed
transformation is what makes the most sense for the presented use case. \$\endgroup\$J_H– J_H2024年06月29日 20:07:21 +00:00Commented Jun 29, 2024 at 20:07