Re: bash 5.2.21-1: a bug in [0-9] expansion


On 2025年09月01日 15:23, Sam Edge via Cygwin wrote:
On 01/09/2025 18:19, Brian Inglis via Cygwin wrote:
 > On 2025年08月31日 13:06, Mariusz Wodzicki via Cygwin wrote:
 >> Description of the problem.
 >> [0-9] picks also certain Unicode superscript characters ( namely, 0 4 5 6
 >> 7 8 9 ), and every Unicode subscript character.
 >>
 >> Example: the directory has the following files:
 >> $ /bin/ls
 >> 0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
 >> 0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
 >>
 >> $ /bin/ls [0-9].txt
 >> 0.txt 1.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt
 >> 0.txt 2.txt 4.txt 5.txt 6.txt 7.txt 8.txt
 >>
 >> $ locale
 >> LANG=en_US.UTF-8
 >> LC_CTYPE="en_US.UTF-8"
 >> LC_NUMERIC="en_US.UTF-8"
 >> LC_TIME="en_US.UTF-8"
 >> LC_COLLATE="en_US.UTF-8"
 >> LC_MONETARY="en_US.UTF-8"
 >> LC_MESSAGES="en_US.UTF-8"
 >> LC_ALL=
 >>
 >> System.
 >> Fully up to date Windows 11
 >> cygwin 3.6.4-1
 >> bash  5.2.21-1
 >
 > For reproducible results prefix commands with LC_ALL=C ... or possibly just 
LC_COLLATE=C or LC_CTYPE=C or =POSIX to standardize the locale, otherwise many 
commands will respect the current locale, and some respect Unicode regardless of 
locale e.g. `info wc`:
 >
 > "Unless the environment variable ‘POSIXLY_CORRECT’ is set, GNU ‘wc’ treats 
the following Unicode characters as white space even if the current locale does 
not: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+202F NARROW NO-BREAK SPACE, 
and U+2060 WORD JOINER."
 >
 > For GNU utilities, where info pages are preferred, such as coreutils*, 
compiler and language processors, and tools packages, many details do not appear 
in the man pages, for example:
 >
 > "Full documentation <https://www.gnu.org/software/coreutils/wc> or available 
locally via: info '(coreutils) wc invocation'"
 >
 > although `info wc` shows the same page.
 >
 > —————
 > * [ arch b2sum base32 base64 basename cat chcon chgrp chmod chown chroot 
cksum comm cp csplit cut date dd df dir dircolors dirname du echo env expand 
expr factor false fmt fold gkill groups head hostid id install join link ln 
logname ls md5sum mkdir mkfifo mknod mktemp mv nice nl nohup nproc numfmt od 
paste pathchk pinky pr printenv printf ptx pwd readlink realpath rm rmdir runcon 
seq sha1sum sha224sum sha256sum sha384sum sha512sum shred shuf sleep sort split 
stat stdbuf stty sum sync tac tail tee test timeout touch tr true truncate tsort 
tty uname unexpand uniq unlink users vdir wc who whoami yes
 >
Bash is GNU but isn't part of coreutils as far as I know. Type 'man bash' and 
then read the 'Pattern Matching' section for its globbing behaviour.
Good point - must have needed brain food! ;^>
TL;DR For bash 5.2, using 'export LC_ALL=C.UTF-8' as Brian suggests or 'export 
LC_COLLATE=C.UTF-8' or 'shopt -s globasciiranges' should revert to simple ASCII 
ranges for '[0-9]', '[a-z]' etc.
I'm seeing the correct behaviour with up-to-date Cygwin bash/coreutils etc. by 
the way. 'echo [0-9]*' only expands out sub/super-digits if I use 
'LC_COLLATE=en_GB.UTF-8' or similar with 'shopt -u globasciiranges'.
What I find interesting is that the superscript low codes 1 \ub9 2 \ub2 3 \ub3 
are not matched nor 9 \u2079 except by higher ranges, while the wider range 
excludes more values, and the classes [:digit:] and equivalences [=0=] do nothing:
$ echo ?.txt
0.txt 0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 
6.txt 7.txt 7.txt 8.txt 8.txt 9.txt 9.txt
$ echo [$'\u2070'-$'\u2079'].txt
0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 
7.txt 7.txt 8.txt 8.txt 9.txt 9.txt
$ echo [$'\u2080'-$'\u2089'].txt
0.txt 0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 
6.txt 7.txt 7.txt 8.txt 8.txt 9.txt
$ echo [$'\u2070'-$'\u2089'].txt
0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 
7.txt 7.txt 8.txt 8.txt 9.txt
$ echo [0-9].txt
0.txt 0.txt 1.txt 2.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 7.txt 7.txt 
8.txt 8.txt
$ echo [$'\u00b2'-$'\u00b9'].txt
1.txt 2.txt 3.txt
$ echo [$'\ub2'-$'\ub9'].txt
1.txt 2.txt 3.txt
$ echo [$'\ub2'$'\ub3'$'\ub9'].txt
1.txt 2.txt 3.txt
$ echo [[=0=][=1=][=2=][=3=]].txt
[[=0=][=1=][=2=][=3=]].txt
$ echo [[:digit:]].txt
[[:digit:]].txt
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut
 -- Antoine de Saint-Exupéry
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Re: bash 5.2.21-1: a bug in [0-9] expansion

Reply via email to