Clean up and datestamp file names

Question 1

I would like to ask for a review of the following Bash script which cleans up and datestamps file names, with an optional feature to rename the file.

The main use case is renaming files downloaded from Overleaf, so that Angebot_produkt_kunde-2.pdf (resulting from the second download from the project without moving the file Angebot_produkt_kunde.pdf out of the way first) gets renamed to Angebot_produkt_kunde-2-2024年06月10日_1248.pdf.

If the file name contains a datestamp, it is updated if the string to append is date.

Another feature is appending custom suffixes, as in

angebot_produkt_Müller-Maßnahmen-5 "Neuer Entwurf" -> Angebot-produkt-mueller-massnahmen-5-neuer-entwurf

In this example, the file name has no extension.

I want to make sure that the file remains in the directory it is in and to skip files whose names start with a dot. Files without extension or with more than one extension are handled correctly.

The file names are also cleaned up so they will not cause problems when archived or moved between file systems. Umlauts and spaces are removed, as are leading, trailing, and excessive internal dots or hyphens. Finally, the new name is in lowercase.

Find test cases at the end of this question.

Specific review questions:

Can this script be written more concisely? I am particularly unhappy about
```
 sed -E -e 's/(^-|([\.-]*)$)//g' \
 -e 's/-\././g'
```
which is needed to get to the new name Txt.pdf instead of Txt-.pdf when called as follows:
```
 ~/bin/newname.sh --printonly '/tmp/@#$.txt.pdf' '!'
```
Is there an accepted name for the base name of a file, but without any extension? That is, in the name /tmp/data.csv, the string /tmp is the directory name, the string data.csv is the base name, the string .csv is the extension, but what is a good name for the string data on its own?
Should I rename the option --printonly to -d (abbreviating "dry-run")? Is anything off with my handling of options?
Have I missed any obvious edge cases for testing? (A non-obvious edge case would be a file name consisting entirely of Hanzi characters, for example. Such a case will not occur in the environment the script is expected to run in.)
What about locales? I am a first-time Cygwin user and am mildly surprised to see echo "$LC_ALL" not produce any output. Should I replace A-Za-z0-9 with [[:alnum:]], or are there any disadvantages to that?

Test cases:

 #!/bin/bash
 
 { 
 ~/bin/newname.sh --printonly Angebot_produkt_kunde_Maßnahmen-6_2024年06月27日_1620.pdf date
 ~/bin/newname.sh --printonly angebot_produkt_Müller-Maßnahmen-4__2024年06月27日_1806.txt.pdf date
 
 ~/bin/newname.sh --printonly angebot_produkt_Müller-Maßnahmen-5 "Neuer Entwurf"
 ~/bin/newname.sh --printonly angebot_produkt_Müller-Maßnahmen-3.pdf "Neuer Entwurf"
 ~/bin/newname.sh --printonly "Angebot 5.pdf" " Neuer Entwurf "
 ~/bin/newname.sh --printonly Report_kunde_17_06_2024____23_06_2024___LuaLaTeX.pdf date
 
 ~/bin/newname.sh --printonly ~/angebot_produkt_Müller-Maßnahmen-3.pdf "Neuer Entwurf"
 ~/bin/newname.sh --printonly /tmp/files/"Angebot 5.pdf" " Neuer Entwurf "
 ~/bin/newname.sh --printonly /tmp/files/"Angebot 5%% Rabatt.pdf" " Neuer Entwurf "
 
 ~/bin/newname.sh --printonly .configfile date
 ~/bin/newname.sh --printonly /tmp/.configfile date
 
 ~/bin/newname.sh --printonly Angebot_produkt_kunde-2.pdf date
 
 ~/bin/newname.sh --printonly '/tmp/@#$%.txt' '!'
 ~/bin/newname.sh --printonly '/tmp/@#$.txt.pdf' '!'
 ~/bin/newname.sh --printonly '/tmp/@#$*' '!'
 
 } | sed -e "s/$USER/user/g" | column -s';' -t

Output:

 New filename without renaming: Angebot-produkt-kunde-massnahmen-6-2024年06月28日-1644.pdf dir=.
 New filename without renaming: Angebot-produkt-mueller-massnahmen-4-2024年06月28日-1644.txt.pdf dir=.
 New filename without renaming: Angebot-produkt-mueller-massnahmen-5-neuer-entwurf dir=.
 New filename without renaming: Angebot-produkt-mueller-massnahmen-3-neuer-entwurf.pdf dir=.
 New filename without renaming: Angebot-5-neuer-entwurf.pdf dir=.
 New filename without renaming: Report-kunde-17-06-2024-23-06-2024-lualatex-2024年06月28日-1644.pdf dir=.
 New filename without renaming: Angebot-produkt-mueller-massnahmen-3-neuer-entwurf.pdf dir=/cygdrive/c/Users/user
 New filename without renaming: Angebot-5-neuer-entwurf.pdf dir=/tmp/files
 New filename without renaming: Angebot-5-prozent-rabatt-neuer-entwurf.pdf dir=/tmp/files
 Will not rename dotfile .configfile
 Will not rename dotfile /tmp/.configfile
 New filename without renaming: Angebot-produkt-kunde-2-2024年06月28日-1644.pdf dir=.
 New filename without renaming: Prozent.txt dir=/tmp
 New filename without renaming: Txt.pdf dir=/tmp
 New filename without renaming: Wtf dir=/tmp

The script:

 #!/bin/bash
 
 # Clean up and datestamp file names; optionally rename the file
 
 # Thure Dührsen, 2024年06月28日
 
 # Environment: Cygwin
 
 # Function to modify the base name of the file
 clean_name() {
 if [ "$#" -ne 1 ]; then
 echo "Usage: clean_name <filename>"
 exit 1
 fi
 
 echo "1ドル" | sed -E -e 's/(Ä|ä)/AE/g' \
 -e 's/(Ö|ö)/OE/g' \
 -e 's/(Ü|ü)/UE/g' \
 -e 's/ß/SS/g' \
 -e 's/%+/-Prozent-/g' \
 -e 's/[[:space:]]/-/g' |
 tr -dc 'A-Za-z .0-9_-' |
 tr '_' '-' |
 tr -s '-' |
 sed -E -e 's/(^-|([\.-]*)$)//g' \
 -e 's/-\././g' |
 tr '[:upper:]' '[:lower:]'
 }
 
 # Function to append string before the final dot or at the end if no dot exists
 append_string_to_filename() {
 if [ "$#" -ne 2 ]; then
 echo "Usage: append_string_to_filename <filename> <string>"
 exit 1
 fi
 
 local filename="$(clean_name "1ドル")"
 local append_string="$(clean_name "2ドル")"
 
 local extension=""
 local base_name=""
 
 if [[ "$filename" =~ ([^.]+)(\..*)$ ]]; then
 base_name="${BASH_REMATCH[1]}"
 extension="${BASH_REMATCH[2]}"
 else
 base_name="$filename"
 fi
 
 local new_base_name="${base_name%.*}"'_'"${append_string}""${extension}"
 local clean_base_name="$(clean_name "$new_base_name")"
 
 # Clean_base_name might no longer contain alphanumeric characters
 if [[ ! "$clean_base_name" =~ ^[A-Za-z0-9]+([-_A-Za-z.0-9]*[A-Za-z0-9])?$ ]]; then
 clean_base_name='wtf'"$extension"
 fi
 
 echo "${clean_base_name^}"
 }
 
 if [[ "1ドル" == "--printonly" ]]; then
 printonly=true
 shift
 else
 printonly=false
 fi
 
 if [[ "$#" -lt 2 || "$#" -gt 3 ]]; then
 echo "Usage: 0ドル [--printonly] <file> string_to_append"
 exit 1
 fi
 
 if [ "$printonly" == "false" ]
 then
 if [ ! -e "1ドル" ]
 then
 echo "file 1ドル does not exist"
 exit 2
 fi
 fi
 
 fullpath="1ドル"
 dir="$(dirname "$fullpath")"
 file_name="$(basename -- "$fullpath")"
 string_to_append="2ドル"
 
 if [[ "$file_name" =~ ^\. ]]
 then
 echo "Will not rename dotfile $fullpath"
 exit 2
 fi
 
 if [[ "$string_to_append" == "date" ]]; then
 string_to_append="$(date '+%F_%H%M')"
 file_name="$(echo "$file_name" | sed -E -e 's/_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}_[[:digit:]]{4}//')"
 fi
 
 new_file_name="$(append_string_to_filename "$file_name" "$string_to_append")"
 
 if [ "$printonly" == "false" ]
 then
 if ! mv -n -v "$dir"/"$file_name" "$dir"/"$new_file_name"
 then
 echo "Failed to rename file $fullpath"
 exit 2
 fi 
 else
 echo "New filename without renaming: $new_file_name ; dir=$dir"
 fi

Question 2

use case

The two main features you describe are

uniquifying "same" name filespecs
case smashing and otherwise regularizing the character set

I'm glad to see the second one broken out as a nice clean_name helper function.

There is maybe a lost opportunity when we rename

Angebot_produkt_kunde-2.pdf to
Angebot_produkt_kunde-2-2024年06月10日_1248.pdf

It seems desirable to strip that now-meaningless -2- in the middle.

It makes me sad that Müller-Maßnahmen is harder to deal with in this day and age than Mueller-Massnahmen. Apparently there is yet more infrastructure work to be done.

current time vs mtime

Consider using stat -c %y $FILE when obtaining a timestamp that will become part of the filename.

character class

more concisely?

 sed -E -e 's/(^-|([\.-]*)$)//g'

Well, within a character class there's no need to \ backwhack the . dot:

 sed -E -e 's/(^-|([.-]*)$)//g'

And it's unclear why you'd want that inner ( ... ) capturing group.

name minus extension

in data.csv, a good name for data?

I recommend stem for this line:

 base_name="${BASH_REMATCH[1]}"

option synonyms

Should I rename the option --printonly to -d

Sure, go for it. Other utilities use that familiar option name, though often as -n. A -d at first glance looks like --debug.

Do be sure to offer --dry-run, as well.

Also, consider setting LC_ALL yourself. I assume that [[:alnum:]] will sometimes match ß + umlaut vowels.

extra character match

I don't understand this "delete, complement" expression:

 tr -dc 'A-Za-z .0-9_-' |

There's a SPACE in the middle, after "z". But we already nuked all [[:space:]] characters when we turned them into - dash.

I'm reading this as enumerating "characters we wish to preserve", so the SPACE seems misleading.

Consider moving this tr into the "umlaut" sed invocation:

 tr '_' '-' |

The tr --squeeze is very nice.

anonymous

 clean_base_name='wtf'"$extension"

Consider adopting the name prefix anon instead.

single responsibility

The mainline code

cracks argv,
computes a new name, and
either prints new name or interacts with filesystem

Consider breaking out helpers, which the mainline calls.

Also, we have a perfectly nice ! -e "1ドル" check for whether file exists, which appears too early in the flow, and therefore needs an extra $printonly conditional.

argument order

Accepting "filespec date" seems less convenient than accepting "date filespec".

The script seems to really want to have xargs invoke it:

find /some/where -type f -name '*.pdf' | xargs newname.sh date

And then filename(s) get tacked on the end of argv. As written, the OP code would need xargs -n 1, but bolting on a for loop is easy once you have a few helpers. Given some of the crazy whitespace characters you need to strip, find -print0 / xargs -0 might also be needed.

Question 3

Thank you very much for the thorough review. As for the removal of the version numbers Angebot_produkt_kunde-2-2024年06月10日_1248.pdf -> Angebot_produkt_kunde-2024年06月10日_1248.pdf I find that GNU sed does not support positive lookahead, so I might settle for something like

echo Angebot_produkt_kunde-2-2024年06月10日_1248.pdf | sed -E 's/(.*)(-[[:digit:]]+)(-[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{4})/1円3円/'

I will definitely take your other suggestions into account as well and, in a few days, probably ask a new question for the rewritten version.

Question 4

I had in mind something simpler, less about regexes and more about filesystem tests. Match a "-2" or a "-\d" suffix, whatever. If we find foo.pdf and foo-2.pdf, then assume the "multiple download versions" scenario you mentioned, strip "-2", and replace it with date. I suppose that’s one level above the OP code, up at the level of iterating across all entries in a directory.

Question 5

Yes, though I am realising this is far from foolproof. Considering Angebot-5.pdf, it is unclear whether this is the fifth offer or the fifth revision of Angebot.pdf. Meanwhile, the requirements are changing in other ways I will have to consult my pillow on.

Question 6

Regarding transliteration of German characters, you could use iconv which is quite probably installed on your system. It is a convenient tool to normalize Unicode strings and convert from one charset to another.

So:

echo 'Müller-Maßnahmen' | iconv -f UTF-8 -t ASCII//TRANSLIT
Muller-Massnahmen

This is a simplistic example but it does the trick. There are many more options and iconv should be locale-aware as well (or even locale-dependent actually). As always, check the docs.

In fact the matter (Unicode transliteration) is quite difficult and the most reasonable approach is to use dedicated tools like iconv for the job. This will also make the code more flexible (not limited to one language subset).

In fact, this works in French too:

echo 'œuf de Pâques' | iconv -f UTF-8 -t ASCII//TRANSLIT
oeuf de Paques

with the usual disclaimer that "it works on my machine/locale".

Question 7

Kate, "use iconv!" is valuable advice so I gave it +1. But it's not quite right for ä, ö, ü. On a 1970's American typewriter the standard was to hit two keystrokes for those: ae, oe, ue. (And not mess around with backspace + " double-quote.) So when you pick up a box of Mueller's pasta, it's perfectly clear the surname is Müller. Iconv is doing reasonable things with the example characters you mention. But for those three German vowels, the OP sed transformation is what makes the most sense for the presented use case.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2024-06-28 19:49:15Z

use case

The two main features you describe are

uniquifying "same" name filespecs
case smashing and otherwise regularizing the character set

I'm glad to see the second one broken out as a nice clean_name helper function.

There is maybe a lost opportunity when we rename

Angebot_produkt_kunde-2.pdf to
Angebot_produkt_kunde-2-2024年06月10日_1248.pdf

It seems desirable to strip that now-meaningless -2- in the middle.

It makes me sad that Müller-Maßnahmen is harder to deal with in this day and age than Mueller-Massnahmen. Apparently there is yet more infrastructure work to be done.

current time vs mtime

Consider using stat -c %y $FILE when obtaining a timestamp that will become part of the filename.

character class

more concisely?

 sed -E -e 's/(^-|([\.-]*)$)//g'

Well, within a character class there's no need to \ backwhack the . dot:

 sed -E -e 's/(^-|([.-]*)$)//g'

And it's unclear why you'd want that inner ( ... ) capturing group.

name minus extension

in data.csv, a good name for data?

I recommend stem for this line:

 base_name="${BASH_REMATCH[1]}"

option synonyms

Should I rename the option --printonly to -d

Sure, go for it. Other utilities use that familiar option name, though often as -n. A -d at first glance looks like --debug.

Do be sure to offer --dry-run, as well.

Also, consider setting LC_ALL yourself. I assume that [[:alnum:]] will sometimes match ß + umlaut vowels.

extra character match

I don't understand this "delete, complement" expression:

 tr -dc 'A-Za-z .0-9_-' |

There's a SPACE in the middle, after "z". But we already nuked all [[:space:]] characters when we turned them into - dash.

I'm reading this as enumerating "characters we wish to preserve", so the SPACE seems misleading.

Consider moving this tr into the "umlaut" sed invocation:

 tr '_' '-' |

The tr --squeeze is very nice.

anonymous

 clean_base_name='wtf'"$extension"

Consider adopting the name prefix anon instead.

single responsibility

The mainline code

cracks argv,
computes a new name, and
either prints new name or interacts with filesystem

Consider breaking out helpers, which the mainline calls.

Also, we have a perfectly nice ! -e "1ドル" check for whether file exists, which appears too early in the flow, and therefore needs an extra $printonly conditional.

argument order

Accepting "filespec date" seems less convenient than accepting "date filespec".

The script seems to really want to have xargs invoke it:

find /some/where -type f -name '*.pdf' | xargs newname.sh date

And then filename(s) get tacked on the end of argv. As written, the OP code would need xargs -n 1, but bolting on a for loop is easy once you have a few helpers. Given some of the crazy whitespace characters you need to strip, find -print0 / xargs -0 might also be needed.

Thank you very much for the thorough review. As for the removal of the version numbers Angebot_produkt_kunde-2-2024年06月10日_1248.pdf -> Angebot_produkt_kunde-2024年06月10日_1248.pdf I find that GNU sed does not support positive lookahead, so I might settle for something like echo Angebot_produkt_kunde-2-2024年06月10日_1248.pdf | sed -E 's/(.*)(-[[:digit:]]+)(-[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{4})/1円3円/' I will definitely take your other suggestions into account as well and, in a few days, probably ask a new question for the rewritten version.
I had in mind something simpler, less about regexes and more about filesystem tests. Match a "-2" or a "-\d" suffix, whatever. If we find foo.pdf and foo-2.pdf, then assume the "multiple download versions" scenario you mentioned, strip "-2", and replace it with date. I suppose that’s one level above the OP code, up at the level of iterating across all entries in a directory.
Yes, though I am realising this is far from foolproof. Considering Angebot-5.pdf, it is unclear whether this is the fifth offer or the fifth revision of Angebot.pdf. Meanwhile, the requirements are changing in other ways I will have to consult my pillow on.

Stack Exchange Network

Clean up and datestamp file names

2 Answers 2

use case

current time vs mtime

character class

name minus extension

option synonyms

extra character match

anonymous

single responsibility

argument order

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

2 Answers 2

use case

current time vs mtime

character class

name minus extension

option synonyms

extra character match

anonymous

argument order

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related