On 2025年09月01日 15:23, Sam Edge via Cygwin wrote:
On 01/09/2025 18:19, Brian Inglis via Cygwin wrote:
> On 2025年08月31日 13:06, Mariusz Wodzicki via Cygwin wrote:
>> Description of the problem.
>> [0-9] picks also certain Unicode superscript characters ( namely, 0 4 5 6
>> 7 8 9 ), and every Unicode subscript character.
>>
>> Example: the directory has the following files:
>> $ /bin/ls
>> 0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
>> 0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
>>
>> $ /bin/ls [0-9].txt
>> 0.txt 1.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt
>> 0.txt 2.txt 4.txt 5.txt 6.txt 7.txt 8.txt
>>
>> $ locale
>> LANG=en_US.UTF-8
>> LC_CTYPE="en_US.UTF-8"
>> LC_NUMERIC="en_US.UTF-8"
>> LC_TIME="en_US.UTF-8"
>> LC_COLLATE="en_US.UTF-8"
>> LC_MONETARY="en_US.UTF-8"
>> LC_MESSAGES="en_US.UTF-8"
>> LC_ALL=
>>
>> System.
>> Fully up to date Windows 11
>> cygwin 3.6.4-1
>> bash 5.2.21-1
>
> For reproducible results prefix commands with LC_ALL=C ... or possibly just
LC_COLLATE=C or LC_CTYPE=C or =POSIX to standardize the locale, otherwise many
commands will respect the current locale, and some respect Unicode regardless of
locale e.g. `info wc`:
>
> "Unless the environment variable ‘POSIXLY_CORRECT’ is set, GNU ‘wc’ treats
the following Unicode characters as white space even if the current locale does
not: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+202F NARROW NO-BREAK SPACE,
and U+2060 WORD JOINER."
>
> For GNU utilities, where info pages are preferred, such as coreutils*,
compiler and language processors, and tools packages, many details do not appear
in the man pages, for example:
>
> "Full documentation <https://www.gnu.org/software/coreutils/wc> or available
locally via: info '(coreutils) wc invocation'"
>
> although `info wc` shows the same page.
>
> —————
> * [ arch b2sum base32 base64 basename cat chcon chgrp chmod chown chroot
cksum comm cp csplit cut date dd df dir dircolors dirname du echo env expand
expr factor false fmt fold gkill groups head hostid id install join link ln
logname ls md5sum mkdir mkfifo mknod mktemp mv nice nl nohup nproc numfmt od
paste pathchk pinky pr printenv printf ptx pwd readlink realpath rm rmdir runcon
seq sha1sum sha224sum sha256sum sha384sum sha512sum shred shuf sleep sort split
stat stdbuf stty sum sync tac tail tee test timeout touch tr true truncate tsort
tty uname unexpand uniq unlink users vdir wc who whoami yes
>
Bash is GNU but isn't part of coreutils as far as I know. Type 'man bash' and
then read the 'Pattern Matching' section for its globbing behaviour.
Good point - must have needed brain food! ;^>
TL;DR For bash 5.2, using 'export LC_ALL=C.UTF-8' as Brian suggests or 'export
LC_COLLATE=C.UTF-8' or 'shopt -s globasciiranges' should revert to simple ASCII
ranges for '[0-9]', '[a-z]' etc.
I'm seeing the correct behaviour with up-to-date Cygwin bash/coreutils etc. by
the way. 'echo [0-9]*' only expands out sub/super-digits if I use
'LC_COLLATE=en_GB.UTF-8' or similar with 'shopt -u globasciiranges'.
What I find interesting is that the superscript low codes 1 \ub9 2 \ub2 3 \ub3
are not matched nor 9 \u2079 except by higher ranges, while the wider range
excludes more values, and the classes [:digit:] and equivalences [=0=] do nothing:
$ echo ?.txt
0.txt 0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt
6.txt 7.txt 7.txt 8.txt 8.txt 9.txt 9.txt
$ echo [$'\u2070'-$'\u2079'].txt
0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt
7.txt 7.txt 8.txt 8.txt 9.txt 9.txt
$ echo [$'\u2080'-$'\u2089'].txt
0.txt 0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt
6.txt 7.txt 7.txt 8.txt 8.txt 9.txt
$ echo [$'\u2070'-$'\u2089'].txt
0.txt 1.txt 1.txt 2.txt 2.txt 3.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt
7.txt 7.txt 8.txt 8.txt 9.txt
$ echo [0-9].txt
0.txt 0.txt 1.txt 2.txt 3.txt 4.txt 4.txt 5.txt 5.txt 6.txt 6.txt 7.txt 7.txt
8.txt 8.txt
$ echo [$'\u00b2'-$'\u00b9'].txt
1.txt 2.txt 3.txt
$ echo [$'\ub2'-$'\ub9'].txt
1.txt 2.txt 3.txt
$ echo [$'\ub2'$'\ub3'$'\ub9'].txt
1.txt 2.txt 3.txt
$ echo [[=0=][=1=][=2=][=3=]].txt
[[=0=][=1=][=2=][=3=]].txt
$ echo [[:digit:]].txt
[[:digit:]].txt
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut
-- Antoine de Saint-Exupéry
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple