Package: grep;
To reply to this bug, email your comments to 79702 AT debbugs.gnu.org.
the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-grep <at> gnu.org:bug#79702; Package grep.
(2025年10月26日 13:54:02 GMT) Full text and rfc822 format available.Dave <dj.2dixx <at> googlemail.com>:bug-grep <at> gnu.org.
(2025年10月26日 13:54:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Dave <dj.2dixx <at> googlemail.com> To: bug-grep <at> gnu.org Subject: request: flag for visually identical but different unicode characters Date: 2025年10月26日 11:00:28 +0330
Today, I realized that there are characters which are visually identical, yet have different unicodes, thus they can't be matched in grep. Example #1: احمدی Example #2: احمدى The ى in both examples are exactly the same, yet the first one is U+06CC, and second one U+0649. From the user's perspective, it's impossible to realize which unicode the word is using. In fact, these two words, even though they are from different languages/keyboards, match perfectly on the other letters, and only it's ی/ى that espaces the match. While not as important, this letter has other variants like ي (notice two dots below it, think an umlaut) corresponding to U+064A. If you press Ctrl + F on your browser, you'd notice that you can match U+064A with U+0649 one. but this is not the default behavior in grep either. I understand there's no straightforward solution for this, so I'm thinking of having an extra flag which converts all visually similar characters to the same unicode and then looks for matches. Thoughts?
bug-grep <at> gnu.org:bug#79702; Package grep.
(2025年10月26日 18:42:02 GMT) Full text and rfc822 format available.Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Collin Funk <collin.funk1 <at> gmail.com> To: Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> Cc: Dave <dj.2dixx <at> googlemail.com>, 79702 <at> debbugs.gnu.org Subject: Re: bug#79702: request: flag for visually identical but different unicode characters Date: 2025年10月26日 11:40:55 -0700
Hi Dave, Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> writes: > Today, I realized that there are characters which are visually > identical, yet have different unicodes, thus they can't be matched in > grep. A bit different from your example, but in some cases you can encode the same character in multiple ways. The character á (LATIN SMALL LETTER A WITH ACUTE) can be written as: * Normalized: U+00E1 * Unnormalized: U+0061 U+0301 > Example #1: > احمدی > > Example #2: > احمدى > > The ى in both examples are exactly the same, yet the first one is > U+06CC, and second one U+0649. > > From the user's perspective, it's impossible to realize which unicode > the word is using. In fact, these two words, even though they are from > different languages/keyboards, match perfectly on the other letters, > and only it's ی/ى that espaces the match. > > While not as important, this letter has other variants like ي (notice > two dots below it, think an umlaut) corresponding to U+064A. If you > press Ctrl + F on your browser, you'd notice that you can match U+064A > with U+0649 one. but this is not the default behavior in grep either. What browser does that? Firefox and Chrome on my machine don't match the other character. Collin
bug-grep <at> gnu.org:bug#79702; Package grep.
(2025年10月26日 18:42:02 GMT) Full text and rfc822 format available.bug-grep <at> gnu.org:bug#79702; Package grep.
(2025年10月26日 19:47:02 GMT) Full text and rfc822 format available.Message #14 received at 79702 <at> debbugs.gnu.org (full text, mbox):
From: arnold <at> skeeve.com To: dj.2dixx <at> googlemail.com, 79702 <at> debbugs.gnu.org Subject: Re: bug#79702: request: flag for visually identical but different unicode characters Date: 2025年10月26日 13:46:48 -0600
Isn't this what equivalence classes (like [[=e=]]) are supposed to solve? Can grep even use them? Arnold Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> wrote: > Today, I realized that there are characters which are visually > identical, yet have different unicodes, thus they can't be matched in > grep. > > Example #1: > احمدی > > Example #2: > احمدى > > The ى in both examples are exactly the same, yet the first one is > U+06CC, and second one U+0649. > > From the user's perspective, it's impossible to realize which unicode > the word is using. In fact, these two words, even though they are from > different languages/keyboards, match perfectly on the other letters, > and only it's ی/ى that espaces the match. > > While not as important, this letter has other variants like ي (notice > two dots below it, think an umlaut) corresponding to U+064A. If you > press Ctrl + F on your browser, you'd notice that you can match U+064A > with U+0649 one. but this is not the default behavior in grep either. > > I understand there's no straightforward solution for this, so I'm > thinking of having an extra flag which converts all visually similar > characters to the same unicode and then looks for matches. Thoughts? > > >
bug-grep <at> gnu.org:bug#79702; Package grep.
(2025年10月26日 22:09:02 GMT) Full text and rfc822 format available.Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):
From: "David G. Pickett" <dgpickett <at> aol.com> To: "bug-grep <at> gnu.org" <bug-grep <at> gnu.org> Subject: Fw: bug#79702: request: flag for visually identical but different unicode characters Date: 2025年10月26日 22:08:03 +0000 (UTC)
[Message part 1 (text/plain, inline)]
bug-grep <at> gnu.org ----- Forwarded Message ----- From: David G. Pickett <dgpickett <at> aol.com>To: Dave <dj.2dixx <at> googlemail.com>Sent: Sunday, October 26, 2025 at 06:07:02 PM EDTSubject: Re: bug#79702: request: flag for visually identical but different unicode characters Even before hackers were using Cyrillic - Roman lookalikes for fake URLs (e.g., chase.com with a Cyrillic a), I recall Sybase doing insensitivity both of case and of Nordic markups in iso-8859-1, like 'A' with a umlaut 'Ä', in string indexes, so this is not a new idea! I am not sure of the utility in practical terms. Who gets to identify the look-alikes? On Sunday, October 26, 2025 at 09:54:42 AM EDT, Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> wrote: Today, I realized that there are characters which are visually identical, yet have different unicodes, thus they can't be matched in grep. Example #1: احمدی Example #2: احمدى The ى in both examples are exactly the same, yet the first one is U+06CC, and second one U+0649. From the user's perspective, it's impossible to realize which unicode the word is using. In fact, these two words, even though they are from different languages/keyboards, match perfectly on the other letters, and only it's ی/ى that espaces the match. While not as important, this letter has other variants like ي (notice two dots below it, think an umlaut) corresponding to U+064A. If you press Ctrl + F on your browser, you'd notice that you can match U+064A with U+0649 one. but this is not the default behavior in grep either. I understand there's no straightforward solution for this, so I'm thinking of having an extra flag which converts all visually similar characters to the same unicode and then looks for matches. Thoughts?
[Message part 2 (text/html, inline)]
bug-grep <at> gnu.org:bug#79702; Package grep.
(2025年10月26日 22:37:02 GMT) Full text and rfc822 format available.Message #20 received at 79702 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> cs.ucla.edu> To: "David G. Pickett" <dgpickett <at> aol.com> Cc: 79702 <at> debbugs.gnu.org Subject: Re: bug#79702: Fw: bug#79702: request: flag for visually identical but different unicode characters Date: 2025年10月26日 15:36:42 -0700
On 2025年10月26日 15:08, David G. Pickett via Bug reports for GNU grep wrote: > Who gets to identify the look-alikes? The Unicode Consortium has done this, and as is usual with characters, it's complicated. See: https://www.unicode.org/reports/tr39/#Confusable_Detection
bug-grep <at> gnu.org:bug#79702; Package grep.
(2025年11月06日 17:15:03 GMT) Full text and rfc822 format available.Message #23 received at 79702 <at> debbugs.gnu.org (full text, mbox):
From: "Dale R. Worley" <Dale.Worley <at> comcast.net> To: 79702 <at> debbugs.gnu.org Subject: Re: bug#79702: Fw: bug#79702: request: flag for visually identical but different unicode characters Date: 2025年11月06日 12:14:32 -0500
Paul Eggert <eggert <at> cs.ucla.edu> writes: >> Who gets to identify the look-alikes? > > The Unicode Consortium has done this, and as is usual with characters, > it's complicated. See: > > https://www.unicode.org/reports/tr39/#Confusable_Detection ISTM that trying to incorporate this functionality into grep would be an endless maintenance chore. Probably better is to have a separate utility (project) that "canonicalize" each confusable character into one standard form. Then you can use grep to do the search. If I've got all the shell constructions right, the one-line form would be: $ grep "$( canonicalize -opts <<<'pattern' )" \ <(canonicalize -opts file) <(canonicalize -opts file) ... Since surely there are variations on canonicalization, I've shown "-opts". Also the "<<<" construction adds a newline at the end, so you need an option to canonicalize to remove the final newline. Dale
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.