Re: bash 5.2.21-1: a bug in [0-9] expansion

2025年9月01日 15:57:27 -0700

On 2025年09月01日 15:23, Sam Edge via Cygwin wrote:
On 01/09/2025 18:19, Brian Inglis via Cygwin wrote:
 > On 2025年08月31日 13:06, Mariusz Wodzicki via Cygwin wrote:
 >> Description of the problem.
 >> [0-9] picks also certain Unicode superscript characters ( namely, 0 4 5 6
 >> 7 8 9 ), and every Unicode subscript character.
 >>
 >> Example: the directory has the following files:
 >> $ /bin/ls
 >> 0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
 >> 0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
 >>
 >> $ /bin/ls [0-9].txt
 >> 0.txt 1.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt
 >> 0.txt 2.txt 4.txt 5.txt 6.txt 7.txt 8.txt
 >>
 >> $ locale
 >> LANG=en_US.UTF-8
 >> LC_CTYPE="en_US.UTF-8"
 >> LC_NUMERIC="en_US.UTF-8"
 >> LC_TIME="en_US.UTF-8"
 >> LC_COLLATE="en_US.UTF-8"
 >> LC_MONETARY="en_US.UTF-8"
 >> LC_MESSAGES="en_US.UTF-8"
 >> LC_ALL=
 >>
 >> System.
 >> Fully up to date Windows 11
 >> cygwin 3.6.4-1
 >> bash  5.2.21-1
 >
> For reproducible results prefix commands with LC_ALL=C ... or possibly just LC_COLLATE=C or LC_CTYPE=C or =POSIX to standardize the locale, otherwise many commands will respect the current locale, and some respect Unicode regardless of locale e.g. `info wc`:
 >
> "Unless the environment variable ‘POSIXLY_CORRECT’ is set, GNU ‘wc’ treats the following Unicode characters as white space even if the current locale does not: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+202F NARROW NO-BREAK SPACE, and U+2060 WORD JOINER."
 >
> For GNU utilities, where info pages are preferred, such as coreutils*, compiler and language processors, and tools packages, many details do not appear in the man pages, for example:
 >
> "Full documentation <https://www.gnu.org/software/coreutils/wc> or available locally via: info '(coreutils) wc invocation'"
 >
 > although `info wc` shows the same page.
 >
 > —————
> * [ arch b2sum base32 base64 basename cat chcon chgrp chmod chown chroot cksum comm cp csplit cut date dd df dir dircolors dirname du echo env expand expr factor false fmt fold gkill groups head hostid id install join link ln logname ls md5sum mkdir mkfifo mknod mktemp mv nice nl nohup nproc numfmt od paste pathchk pinky pr printenv printf ptx pwd readlink realpath rm rmdir runcon seq sha1sum sha224sum sha256sum sha384sum sha512sum shred shuf sleep sort split stat stdbuf stty sum sync tac tail tee test timeout touch tr true truncate tsort tty uname unexpand uniq unlink users vdir wc who whoami yes
 >
Bash is GNU but isn't part of coreutils as far as I know. Type 'man bash' and then read the 'Pattern Matching' section for its globbing behaviour.
Good point - must have needed brain food! ;^>
TL;DR For bash 5.2, using 'export LC_ALL=C.UTF-8' as Brian suggests or 'export LC_COLLATE=C.UTF-8' or 'shopt -s globasciiranges' should revert to simple ASCII ranges for '[0-9]', '[a-z]' etc. I'm seeing the correct behaviour with up-to-date Cygwin bash/coreutils etc. by the way. 'echo [0-9]*' only expands out sub/super-digits if I use 'LC_COLLATE=en_GB.UTF-8' or similar with 'shopt -u globasciiranges'.
What I find interesting is that the superscript low codes 1 \ub9 2 \ub2 3 \ub3 are not matched nor 9 \u2079 except by higher ranges, while the wider range excludes more values, and the classes [:digit:] and equivalences [=0=] do nothing:
$ echo ?.txt
0.txt 0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 7.txt 7.txt 8.txt 8.txt 9.txt 9.txt
$ echo [$'\u2070'-$'\u2079'].txt
0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 7.txt 7.txt 8.txt 8.txt 9.txt 9.txt
$ echo [$'\u2080'-$'\u2089'].txt
0.txt 0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 7.txt 7.txt 8.txt 8.txt 9.txt
$ echo [$'\u2070'-$'\u2089'].txt
0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 7.txt 7.txt 8.txt 8.txt 9.txt
$ echo [0-9].txt
0.txt 0.txt 1.txt 2.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 7.txt 7.txt 8.txt 8.txt
$ echo [$'\u00b2'-$'\u00b9'].txt
1.txt 2.txt 3.txt
$ echo [$'\ub2'-$'\ub9'].txt
1.txt 2.txt 3.txt
$ echo [$'\ub2'$'\ub3'$'\ub9'].txt
1.txt 2.txt 3.txt
$ echo [[=0=][=1=][=2=][=3=]].txt
[[=0=][=1=][=2=][=3=]].txt
$ echo [[:digit:]].txt
[[:digit:]].txt
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut
 -- Antoine de Saint-Exupéry
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Reply via email to