3
\$\begingroup\$

If you have files with Cyrillic filenames (e.g. день) and pack them as a ZIP archive on Windows, and then unpack on Mac using the standard archive utility, the filenames are often in the wrong encoding. For example: бвгѓ•≠м

Here is a bash script that renames them to the correct ones:

function rename() {
 tr '†°Ґ£§•с¶І®©TMЂђ≠а-р' 'а-еёж-нр-яЁ' <<< "1ドル" | sed $'s/Г\xcc\x81/о/g;s/у\xcc\x81/п/g;s/ш\xcc\x86/щ/g'
}
function renamefile() {
 local new="$(rename "2ドル")"
 if [[ "2ドル" != "$new" ]]; then
 mv "1ドル/2ドル" "1ドル/$new"
 echo "$new"
 fi
}
function scan() {
 ls -1 "1ドル" | while read file; do
 if [ -d "1ドル/$file" ]; then
 scan "1ドル/$file"
 fi
 renamefile "1ドル" "$file"
 done
}
scan "${1-.}"

Usage:

<script> <dir_with_files_with_wrong_filenames>

However, some users complained:

You can't run it twice - the names will be corrupted again.

I threw the script into the Downloads directory, launched it, but for some reason it started renaming from the root directory instead - and corrupted filenames EVERYWHERE.

I then replaced

scan "${1-.}"

with

SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" &> /dev/null && pwd)
scan "${1-${SCRIPT_DIR}}"

But I'm not sure this really fixes the second issue and also is generally safe enough. Could someone make a good safety review?

asked Mar 1 at 14:47
\$\endgroup\$
6
  • \$\begingroup\$ Yikes. Can you show some more example filenames before and after, and also describe what encoding the filenames are assumed to be in every step of the way? Apparently OSX uses UTF-8. \$\endgroup\$ Commented Mar 1 at 15:14
  • \$\begingroup\$ ...and NTFS uses UTF-16. \$\endgroup\$ Commented Mar 1 at 15:16
  • \$\begingroup\$ Please specify what kind of archive it is. If it's a zip, it also seems to be UTF-8. \$\endgroup\$ Commented Mar 1 at 15:18
  • \$\begingroup\$ @Reinderien ZIP. macOS uses UTF-8, as far as I know. \$\endgroup\$ Commented Mar 1 at 15:45
  • \$\begingroup\$ You need more examples. As a wild guess informed by an exhaustive search of Python encodings, this looks potentially like the 'good string' encoded as CP866 and then mis-decoded as mac_cyrillic. \$\endgroup\$ Commented Mar 1 at 16:10

1 Answer 1

5
\$\begingroup\$

If you have files with Cyrillic filenames (e.g. день)

Specifically, filenames with Cyrillic characters; but that isn't all - your tr invocation shows that Latin characters are also affected.

and pack them as a ZIP archive on Windows, and then unpack on Mac

What version of Windows? Sold for what region? What archiving utility - Winzip itself? (It doesn't list explicit support for Cyrillic.) Mac sold for what region? The bug may be sensitive to these factors and you should understand more about what they are. You should also try zipping from another system (Linux?) to your Mac and from your Windows system to another system (Linux?) to see who's at fault for the mis-encoding.

You ask about general safety. No, I don't think that this script is safe enough. The problem is a classic instance of mojibake. First, you have an incomplete translation table in your call to tr; I would include the entire character set defined for the first byte. If this is an interaction between CP866 and Mac Cyrillic which seems to be at least partially true, those are both single-byte character sets and the translation table won't have to be large. You should construct this table exhaustively by archiving long filenames with every printable character.

I don't trust Bash to do this job. Python is better-equipped; it has extensive character set support including in codecs and path manipulation in pathlib. It can still run as a script with a shebang, and I consider it much more readable and flexible.

From

You can't run it twice - the names will be corrupted again.

you also have idempotence problems. About the best you can do here is to write a heuristic that checks for all characters in all filenames in the directory; if any are in a Unicode category that seems like a symbol and is outside of CP866 printable range (0x20-0xAF, 0xE0-0xF7), then assume translation is needed; otherwise do not perform translation.

You may also want to attempt reading the zip yourself with zipfile to see what the filenames look like prior to hitting the filesystem.

answered Mar 1 at 17:40
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.