GNU bug report logs - #37754
wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)

Previous Next

Package: grep;

Reported by: "Trent W. Buck" <trentbuck <at> gmail.com>

Date: 2019年10月15日 01:49:01 UTC

Severity: wishlist

Found in version 3.3-1

To reply to this bug, email your comments to 37754 AT debbugs.gnu.org.

the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月15日 01:49:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Trent W. Buck" <trentbuck <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (2019年10月15日 01:49:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: submit <at> debbugs.gnu.org
Subject: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2019年10月15日 12:48:17 +1100
Package: grep
Version: 3.3-1
Severity: wishlist
This bug was originally reported as
https://bugs.debian.org/940464
Trent W. Buck wrote:
> (Surely someone has already asked for this, but I can't see where.
> I may have already reported this myself, and forgotten.
> If so, sorry!)
>
> Right now if you do
>
> grep -eX -eY -eZ
>
> You'll get lines that match *any of* X, Y, or Z.
> Quite often I want to search for lines that match *all of* X, Y, and Z — but in any order.
> For example,
>
> # all 4TB 2.5-inch SATA products
> grep -Fwi -eSATA -e2TB -e2.5in products.csv
>
> Below is a short discussion of the workarounds I know about.
>
> Is "grep --and" something that has already been discussed and rejected?
> I looked through debbugs.gnu.org and the source tarball, but
> I couldn't find anything about this.
>
>
> PS: grep -v --and would intuitively mean "not all",
> i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> omit lines matching *both* X and Y.
>
> PS: I can't decide if "--and" or "--intersection" is a better name.
> I put both in the bug subject so people searching for either will find this ticket.
> I think "--all" is probably too confusing.
>
>
>
> Workaround #1
> =============
> I can work around this by listing every possible order, but 1) this
> scales poorly with the number of patterns; and 2) it can't be used
> with -F. For example,
>
> grep --and -eX -eY -eZ input*.txt # becomes
>
> grep -eZ.*Y.*X \
> -eZ.*X.*Y \
> -eY.*Z.*X \
> -eY.*X.*Z \
> -eX.*Z.*Y \
> -eX.*Y.*Z \
> input*.txt
>
>
> Workaround #2
> =============
> I can pipe greps together. This is what I currently do.
> This is more convenient and feels faster than workaround #1, but
> I suspect the inter-process overhead is significant.
>
> If grep implemented this internally, it could zero-copy.
> Being able to "grep -rnH --and" &c would also be convenient.
>
> For example,
>
> grep --and -F -eX -eY -eZ input*.txt # becomes
>
> cat input*.txt |
> grep -F -eX |
> grep -F -eY |
> grep -F -eZ

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月16日 12:27:02 GMT) Full text and rfc822 format available.

Message #8 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ
 (X∩Y∩Z
 intersection, not X∪Y∪Z union)
Date: 2019年10月16日 21:26:21 +0900
On 2019年10月15日 12:48:17 +1100
"Trent W. Buck" <trentbuck <at> gmail.com> wrote:
> Package: grep
> Version: 3.3-1
> Severity: wishlist
> 
> This bug was originally reported as
> https://bugs.debian.org/940464
> 
> Trent W. Buck wrote:
> > (Surely someone has already asked for this, but I can't see where.
> > I may have already reported this myself, and forgotten.
> > If so, sorry!)
> >
> > Right now if you do
> >
> > grep -eX -eY -eZ
> >
> > You'll get lines that match *any of* X, Y, or Z.
> > Quite often I want to search for lines that match *all of* X, Y, and Z ? but in any order.
> > For example,
> >
> > # all 4TB 2.5-inch SATA products
> > grep -Fwi -eSATA -e2TB -e2.5in products.csv
> >
> > Below is a short discussion of the workarounds I know about.
> >
> > Is "grep --and" something that has already been discussed and rejected?
> > I looked through debbugs.gnu.org and the source tarball, but
> > I couldn't find anything about this.
> >
> >
> > PS: grep -v --and would intuitively mean "not all",
> > i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> > omit lines matching *both* X and Y.
> >
> > PS: I can't decide if "--and" or "--intersection" is a better name.
> > I put both in the bug subject so people searching for either will find this ticket.
> > I think "--all" is probably too confusing.
> >
> >
> >
> > Workaround #1
> > =============
> > I can work around this by listing every possible order, but 1) this
> > scales poorly with the number of patterns; and 2) it can't be used
> > with -F. For example,
> >
> > grep --and -eX -eY -eZ input*.txt # becomes
> >
> > grep -eZ.*Y.*X \
> > -eZ.*X.*Y \
> > -eY.*Z.*X \
> > -eY.*X.*Z \
> > -eX.*Z.*Y \
> > -eX.*Y.*Z \
> > input*.txt
> >
> >
> > Workaround #2
> > =============
> > I can pipe greps together. This is what I currently do.
> > This is more convenient and feels faster than workaround #1, but
> > I suspect the inter-process overhead is significant.
> >
> > If grep implemented this internally, it could zero-copy.
> > Being able to "grep -rnH --and" &c would also be convenient.
> >
> > For example,
> >
> > grep --and -F -eX -eY -eZ input*.txt # becomes
> >
> > cat input*.txt |
> > grep -F -eX |
> > grep -F -eY |
> > grep -F -eZ
> 
Although I do not know wheter it is discussed and/or rejected, to add
the function to grep, internal conversion as workaround #1 will be
impremented in grep. However, it scales poorly as you say, and it will
be slower than workaround #2 in many cases.

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月16日 18:58:01 GMT) Full text and rfc822 format available.

Message #11 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2019年10月16日 11:57:31 -0700
Wouldn't it be more useful to have an intersection operator in regular 
expressions? That is, the pattern 'A\&B' would match anything that is 
matched by both A and B. If A and B have parenthesized subexpressions, 
both sets of parentheses would match and would count.
Assuming concatenation has higher precedence than \&, the requested 
behavior could be achieved via:
 grep '.*X.*\&.*Y.*\&.*Z.*'
This approach would allow intersection to be nested inside other 
operations. Also, it would clarify how other features work. For example, 
grep -o has clear semantics with this approach, whereas the semantics of 
grep -o are not so clear with the proposed --and option.

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月17日 00:21:01 GMT) Full text and rfc822 format available.

Message #14 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2019年10月17日 11:19:54 +1100
Paul Eggert wrote:
> Wouldn't it be more useful to have an intersection operator in regular
> expressions? That is, the pattern 'A\&B' would match anything that is
> matched by both A and B. If A and B have parenthesized subexpressions, both
> sets of parentheses would match and would count.
Not for me personally, because I almost always want to use it with -Fwi :-)
(-F is a lot faster - about as fast as LC_COLLATE - and it also means
I don't have to think about escaping special characters.)
> [...]
>
> This approach would allow intersection to be nested inside other operations.
> Also, it would clarify how other features work. For example, grep -o has
> clear semantics with this approach, whereas the semantics of grep -o are not
> so clear with the proposed --and option.
I hadn't thought about -o, and I agree that is not very obvious.
Given an input file like
 30$	Gamdias EROS (M2) USB Multi-Color Lighting Gaming Headset
 30$	Gamdias POSEIDON E1 Gaming Combo 3-in-1 K/B+3200dpi Optical Mouse+Stereo Headset
 30$	GeIL (GP34GB1600C11SC) 4GB DDR3 1600 Desktop RAM
 30$	GeIL Pristine (GP44GB2400C17SC) 4GB Single DDR4 2400 Desktop RAM
 30$	GeIL SO-DIMM 4GB (GGS34GB1600C11SC) 1.35V (Low Voltage) 4GB DDR3 1600 Notebook Ram
Where currently "grep -Fw -e 4GB -e DDR4 -o" prints
 4GB
 4GB
 DDR4
 4GB
 4GB
I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
 grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
i.e.
 4GB
 DDR4

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月17日 08:28:02 GMT) Full text and rfc822 format available.

Message #17 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2019年10月17日 01:27:35 -0700
On 10/16/19 5:19 PM, Trent W. Buck wrote:
> I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> 
> grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
You're right, it's not obvious. :-)
It may be better to just pipe greps together, as you do now. That's simple and 
fast enough for this relatively-uncommon case, and it's portable to all greps.

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月18日 11:50:02 GMT) Full text and rfc822 format available.

Message #20 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2019年10月18日 22:49:23 +1100
Paul Eggert wrote:
> On 10/16/19 5:19 PM, Trent W. Buck wrote:
> > I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> >
> > grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
>
> You're right, it's not obvious. :-)
>
> It may be better to just pipe greps together, as you do now. That's simple
> and fast enough for this relatively-uncommon case, and it's portable to all
> greps.
I admit that most of the time, I want "grep --and" for a small dataset
(<1MB computer_parts.txt), so it's merely a convenience.
Sometimes I grep audit logs (~1TB uncompressed), which takes anywhere
from 15 minutes to 3 days, depending on how I tweak my grep calls.
In that case, each grep in the pipeline has to pay the costs to
de-serialize input from the previous grep, and re-serialize output to
the next grep. If the first grep matches (say) 200GB of the 1TB,
that's can be a lot of overhead (I assume).
I was basically hoping that if it was all in a single grep process,
the de/serialization steps could be skipped completely.
I think the buzzword for that is "zero-copy"?
I've noticed "grep" is about 30% slower than either "grep -F" or
"LC_COLLATE=C grep", because (I think) it avoids the costs of decoding
from UTF-8 to Unicode and back. So I was basically expecting a
similar saving from --and.
I'm only speaking as an end user - I haven't dug through the grep
source, so those expectations might be unrealistic, and implementing
it might be painful/impossible. I figured I should at least ask :-)
If your expert opinion is that it's a pain to implement (and
maintain!) and there's not enough demand, then I'm OK with that.
This is NOT something that's burning me every day.
Regardless, I appreciate you taking the time to discuss it. :-)
PS: Regarding portability, I'm personally not worried because when I
need a GNUism badly enough (e.g. du --threshold), I can usually get
permission to install the relevant GNU software, even if it's only
into %APPDATA% or $HOME.
PS: I noticed on bugs.gnu.org something about grep being
single-threaded, which might mean "grep --and" would end up being
SLOWER than the existing pipelines, since the kernel can distribute
a pipeline's elements across multiple CPUs/cores.

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月18日 17:52:03 GMT) Full text and rfc822 format available.

Message #23 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2019年10月18日 10:51:29 -0700
On 10/18/19 4:49 AM, Trent W. Buck wrote:
> In that case, each grep in the pipeline has to pay the costs to
> de-serialize input from the previous grep
Sure, but grep is designed to be a simple tool and we need to draw the 
line somewhere. For something more complicated there are already sed and 
awk (if you want to write to POSIX) or Perl or Python or whatever.
I mildly of prefer the A\&B notation because it could be used 
everywhere, not just in grep. (But of course someone would have to 
implement it. :-)

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月18日 22:37:01 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Paul Jackson" <pj <at> usa.net>
To: bug-grep <at> gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2019年10月18日 17:35:38 -0500
I'm currently working on rewriting and packaging up a tool that
I use to handle such high volume [and/or/not] filters on long lists
of file pathnames and of log file entries. It's a tool I've had in
my private toolbox for decades. I call it "ftest". It has a rich set
of "test" like flags for testing stat(2) attributes of file, but is
optimized for working in pipelines (as a filter, hence the "f").
Trent - do you need regular expression matching, or is glob matching
easily sufficient, or would even just fixed string matching be useful?
For [and/or/not] logical combinations of full regular expressions, 
I'll probably continue to use awk, as Paul Eggert suggested, though
that might be because I've long been an awk user, since teaching
an awk class to other engineers inside Bell Labs, some 40 years ago.
Perhaps sometime, months into the future, I'll follow up with an
update pointing to my "ftest" command on github.
-- 
 Paul Jackson
 pj <at> usa.net

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2019年10月19日 06:58:01 GMT) Full text and rfc822 format available.

Message #29 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ
 (X∩Y∩Z
 intersection, not X∪Y∪Z union)
Date: 2019年10月19日 15:57:39 +0900
On 2019年10月15日 12:48:17 +1100
"Trent W. Buck" <trentbuck <at> gmail.com> wrote:
> Package: grep
> Version: 3.3-1
> Severity: wishlist
> 
> This bug was originally reported as
> https://bugs.debian.org/940464
> 
> Trent W. Buck wrote:
> > (Surely someone has already asked for this, but I can't see where.
> > I may have already reported this myself, and forgotten.
> > If so, sorry!)
> >
> > Right now if you do
> >
> > grep -eX -eY -eZ
> >
> > You'll get lines that match *any of* X, Y, or Z.
> > Quite often I want to search for lines that match *all of* X, Y, and Z ? but in any order.
> > For example,
> >
> > # all 4TB 2.5-inch SATA products
> > grep -Fwi -eSATA -e2TB -e2.5in products.csv
> >
> > Below is a short discussion of the workarounds I know about.
> >
> > Is "grep --and" something that has already been discussed and rejected?
> > I looked through debbugs.gnu.org and the source tarball, but
> > I couldn't find anything about this.
> >
> >
> > PS: grep -v --and would intuitively mean "not all",
> > i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> > omit lines matching *both* X and Y.
> >
> > PS: I can't decide if "--and" or "--intersection" is a better name.
> > I put both in the bug subject so people searching for either will find this ticket.
> > I think "--all" is probably too confusing.
> >
> >
> >
> > Workaround #1
> > =============
> > I can work around this by listing every possible order, but 1) this
> > scales poorly with the number of patterns; and 2) it can't be used
> > with -F. For example,
> >
> > grep --and -eX -eY -eZ input*.txt # becomes
> >
> > grep -eZ.*Y.*X \
> > -eZ.*X.*Y \
> > -eY.*Z.*X \
> > -eY.*X.*Z \
> > -eX.*Z.*Y \
> > -eX.*Y.*Z \
> > input*.txt
> >
> >
> > Workaround #2
> > =============
> > I can pipe greps together. This is what I currently do.
> > This is more convenient and feels faster than workaround #1, but
> > I suspect the inter-process overhead is significant.
> >
> > If grep implemented this internally, it could zero-copy.
> > Being able to "grep -rnH --and" &c would also be convenient.
> >
> > For example,
> >
> > grep --and -F -eX -eY -eZ input*.txt # becomes
> >
> > cat input*.txt |
> > grep -F -eX |
> > grep -F -eY |
> > grep -F -eZ
> 
> 
> Workaround #1
> =============
> I can work around this by listing every possible order, but 1) this
> scales poorly with the number of patterns; and 2) it can't be used
> with -F. For example,
>
> grep --and -eX -eY -eZ input*.txt # becomes
>
> grep -eZ.*Y.*X \
> -eZ.*X.*Y \
> -eY.*Z.*X \
> -eY.*X.*Z \
> -eX.*Z.*Y \
> -eX.*Y.*Z \
> input*.txt
I have noticed that the above two do not necessarily produce the same results.
 grep --and -e123 -e234 input*.txt
 grep --and -e '123.*234' -e '234.*123' input*.txt
"1234" matches first, but it does not match second. 

Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (2023年1月18日 01:59:01 GMT) Full text and rfc822 format available.

Message #32 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: 2023年1月18日 12:57:53 +1100
[Message part 1 (text/plain, inline)]
On Fri 18 Oct 2019 22:49:23 +1100, Trent W. Buck wrote:
> Paul Eggert wrote:
> > On 10/16/19 5:19 PM, Trent W. Buck wrote:
> > > I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> > >
> > > grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
> >
> > You're right, it's not obvious. :-)
> >
> > It may be better to just pipe greps together, as you do now. That's simple
> > and fast enough for this relatively-uncommon case, and it's portable to all
> > greps.
> 
> I admit that most of the time, I want "grep --and" for a small dataset
> (<1MB computer_parts.txt), so it's merely a convenience.
I noticed I forgot to attach a helper script I've been using for decades.
Here it is.
[foldr.sh (application/x-sh, attachment)]

This bug report was last modified 2 years and 357 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

AltStyle によって変換されたページ (->オリジナル) /