Package: grep;
To reply to this bug, email your comments to 32704 AT debbugs.gnu.org.
the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月11日 16:27:01 GMT) Full text and rfc822 format available.21naown <at> gmail.com:bug-grep <at> gnu.org.
(2018年9月11日 16:27:01 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: 21naown <at> gmail.com To: bug-grep <at> gnu.org Subject: Can grep search for a line feed and a null character at the same time? Date: 2018年9月11日 18:25:20 +0200
Hello, I found someone who asked the same question on "Stack Overflow", still unanswered, but this person did not ask it on the mailing list. Here are the details of the question which are nearly similar to my case: https://stackoverflow.com/questions/50295772/can-grep-search-for-a-line-feed-and-a-null-character-at-the-same-time Thank you for your understanding. Best regards.
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月11日 17:04:01 GMT) Full text and rfc822 format available.Message #8 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: Eric Blake <eblake <at> redhat.com> To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月11日 12:03:17 -0500
On 9/11/18 11:25 AM, 21naown <at> gmail.com wrote: > Hello, > > > I found someone who asked the same question on "Stack Overflow", still > unanswered, but this person did not ask it on the mailing list. > > Here are the details of the question which are nearly similar to my case: > https://stackoverflow.com/questions/50295772/can-grep-search-for-a-line-feed-and-a-null-character-at-the-same-time Per 'info grep': 15. How can I match across lines? Standard grep cannot do this, as it is fundamentally line-based. Therefore, merely using the ‘[:space:]’ character class does not match newlines in the way you might expect. With the GNU ‘grep’ option ‘-z’ (‘--null-data’), each input and output "line" is null-terminated; *note Other Options::. Thus, you can match newlines in the input, but typically if there is a match the entire input is output, so this usage is often combined with output-suppressing options like ‘-q’, e.g.: printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]\+bar' If this does not suffice, you can transform the input before giving it to ‘grep’, or turn to ‘awk’, ‘sed’, ‘perl’, or many other utilities that are designed to operate across lines. Grep does not have the ability to match hex or octal backslash sequences, and a literal newline in the pattern is taken as a separation of patterns. Use of [:space:] to include newline alongside other things sort of works. But maybe we really do have a bug - when -z is in effect, I'd expect NUL, rather than newline, to be the byte that separates separate patterns in the pattern argument (and thus expressing a literal newline, as in shells that understand $'\n$', to be viable for writing a single pattern that matches exactly one newline byte at the end of a NUL-separated record). That said, your EASIEST approach is to use iconv to recode your file out of UTF-16 (which is NOT conducive to multi-byte processing), into something friendlier like UTF-8, and then use grep on the converted file. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月11日 17:15:02 GMT) Full text and rfc822 format available.Message #11 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> cs.ucla.edu> To: Eric Blake <eblake <at> redhat.com>, 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月11日 10:14:51 -0700
On 9/11/18 10:03 AM, Eric Blake wrote: > maybe we really do have a bug - when -z is in effect, I'd expect NUL, > rather than newline, to be the byte that separates separate patterns > in the pattern argument You're right, I think it's a bug that grep -zf FILE uses newline separators in FILE. It should use NUL separators. This cannot be done for NUL bytes in command-line patterns, though, since command-line arguments cannot contain NUL bytes.
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月11日 17:41:02 GMT) Full text and rfc822 format available.Message #14 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: Eric Blake <eblake <at> redhat.com> To: Paul Eggert <eggert <at> cs.ucla.edu>, 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月11日 12:39:54 -0500
On 9/11/18 12:14 PM, Paul Eggert wrote: > On 9/11/18 10:03 AM, Eric Blake wrote: >> maybe we really do have a bug - when -z is in effect, I'd expect NUL, >> rather than newline, to be the byte that separates separate patterns >> in the pattern argument > > You're right, I think it's a bug that grep -zf FILE uses newline > separators in FILE. It should use NUL separators. > > This cannot be done for NUL bytes in command-line patterns, though, > since command-line arguments cannot contain NUL bytes. Indeed. But that merely means that on the command line, when -z is in effect, you can't specify multiple patterns (but instead have to use -f FILE if that's what you really want). Meanwhile, the effect on being able to match a literal newline would be observable from either the command line or -f FILE. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月15日 17:07:01 GMT) Full text and rfc822 format available.Message #17 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: Eric Blake <eblake <at> redhat.com> To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月15日 12:06:47 -0500
On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote: > Thank you for your messages. > > It is possible I did not understand correctly your messages, because > grep finds hex sequences with the "-Pa" options at least. grep -P introduces a completely different regex engine, with its own quirks. As such, it does introduce different rules on backslash sequences accepted. > > Examples—"input.txt" contains, from the file system, for example > "\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00": > > grep -Pa '\x00' input.txt > → found > grep -Pza '\x0A' input.txt > → found > grep -Pa '\x0A\x00' input.txt This will never match - when you are not using -z, there are no \x0A in the input stream (they have all been consumed by grep parsing one line at a time, ending at \x0A). Instead, you'll want to search for '^\x00' or '\x00$' for a pattern anchored to a line transition, to find patterns where newline was next to NUL. > grep -Pza '\x0A\x00' input.txt > → not found for the both Similarly, when you are using -z, there are no \x00 in the input stream (they have all been consumed by grep parsing one NUL-terminated record at a time, ending at \x00). Instead, you'll want to search for '^\x0a' or '\x0a$' for a pattern anchored to a record transition, to find patterns where newline was next to NUL. > > But is it at least possible to find "\x0A\x00" with grep? If you bend the rules by throwing -P into the mix, yes :) -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月15日 17:44:01 GMT) Full text and rfc822 format available.Message #20 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: 21naown <at> gmail.com To: 32704 <at> debbugs.gnu.org, Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月15日 18:43:44 +0200
Thank you for your messages. It is possible I did not understand correctly your messages, because grep finds hex sequences with the "-Pa" options at least. Examples—"input.txt" contains, from the file system, for example "\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00": grep -Pa '\x00' input.txt → found grep -Pza '\x0A' input.txt → found grep -Pa '\x0A\x00' input.txt grep -Pza '\x0A\x00' input.txt → not found for the both But is it at least possible to find "\x0A\x00" with grep?
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月15日 17:58:02 GMT) Full text and rfc822 format available.Message #23 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: 21naown <at> gmail.com To: 32704 <at> debbugs.gnu.org, Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月15日 19:57:24 +0200
Le 15/09/2018 à 19:06, Eric Blake a écrit : > On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote: >> Thank you for your messages. >> >> It is possible I did not understand correctly your messages, because >> grep finds hex sequences with the "-Pa" options at least. > > grep -P introduces a completely different regex engine, with its own > quirks. As such, it does introduce different rules on backslash > sequences accepted. > >> >> Examples—"input.txt" contains, from the file system, for example >> "\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00": >> >> grep -Pa '\x00' input.txt >> → found >> grep -Pza '\x0A' input.txt >> → found >> grep -Pa '\x0A\x00' input.txt > > This will never match - when you are not using -z, there are no \x0A > in the input stream (they have all been consumed by grep parsing one > line at a time, ending at \x0A). Instead, you'll want to search for > '^\x00' or '\x00$' for a pattern anchored to a line transition, to > find patterns where newline was next to NUL. > >> grep -Pza '\x0A\x00' input.txt >> → not found for the both > > Similarly, when you are using -z, there are no \x00 in the input > stream (they have all been consumed by grep parsing one > NUL-terminated record at a time, ending at \x00). Instead, you'll > want to search for '^\x0a' or '\x0a$' for a pattern anchored to a > record transition, to find patterns where newline was next to NUL. > >> >> But is it at least possible to find "\x0A\x00" with grep? > > If you bend the rules by throwing -P into the mix, yes :) > So it is possible to find "\x0A\x00" alone, but for example "\x74\x00\x0D\x00\x0A\x00\x74\x00\x6500円" is impossible to find with the "-P" option?
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月15日 20:21:01 GMT) Full text and rfc822 format available.Message #26 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: Assaf Gordon <assafgordon <at> gmail.com> To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org, Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月15日 14:20:40 -0600
Hello,
On 15/09/18 11:57 AM, 21naown <at> gmail.com wrote:
> Le 15/09/2018 à 19:06, Eric Blake a écrit :
>> On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote:
>>> But is it at least possible to find "\x0A\x00" with grep?
>>
>> If you bend the rules by throwing -P into the mix, yes :)
>>
> So it is possible to find "\x0A\x00" alone, but for example
> "\x74\x00\x0D\x00\x0A\x00\x74\x00\x6500円" is impossible to find with the
> "-P" option?
If I may suggest a different tool, GNU sed can handle such regexes more
easily than grep.
The 'trick' is to accumulate multiple lines into memory, then run the
regex on the entire buffer.
1.
If you input is small enough to fit in memory,
you can load the entire file into memory,
and run the regex on the buffer:
$ printf
'\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00'
\
| LC_ALL=C sed -n 'H;$!d ; x ; /\x0a\x00/q0 ; q1' \
&& echo MATCH || echo NO-MATCH
The "H;$!d" commands accumulate lines into the hold buffer.
The "x" command copies the hold buffer into the pattern buffer.
Then the regex "/\x0a\x00/" searches in the buffer.
If there was a match, sed quits with exit code 0 ("q0").
Otherwise, sed quits with exit code 1 ("q1").
2.
If the file is too big to fit in memory,
you can process it line-by-line like so:
$ printf
'\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00'
\
| LC_ALL=C sed -n 'N;/\x00\x0a/q0;$q1;D;' \
&& echo MATCH || echo NO-MATCH
The N,D commands work in tandem to append the next line into the
buffer, then delete the last line from the buffer (think FIFO).
The regex then operates on the buffer which contains the last two lines.
More details are in the manual:
https://www.gnu.org/software/sed/manual/sed.html#Multiline-techniques
https://www.gnu.org/software/sed/manual/sed.html#Text-search-across-multiple-lines
regards,
- assaf
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月15日 20:28:01 GMT) Full text and rfc822 format available.Message #29 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: Eric Blake <eblake <at> redhat.com> To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月15日 15:27:08 -0500
On 9/15/18 12:57 PM, 21naown <at> gmail.com wrote: >>> But is it at least possible to find "\x0A\x00" with grep? >> >> If you bend the rules by throwing -P into the mix, yes :) >> > So it is possible to find "\x0A\x00" alone, but for example > "\x74\x00\x0D\x00\x0A\x00\x74\x00\x6500円" is impossible to find with the > "-P" option? Correct. It is impossible to find the record terminator in the middle of a pattern, whether that terminator is \n (default) or NUL (-z). It is therefore impossible to find a multi-record match using grep. The string you listed contains both \x00 and \x0a, so regardless of which of those two bytes you pick as the record terminator, it is impossible to use grep to find that substring in your file. You'll have to resort to a tool that supports multiline matching, since grep is not such a tool. It IS possible, of course, to change your data, for example: tr '0円' '\xff' < file | grep $modified_pattern | tr '\xff' '0円' assuming that \xff didn't appear anywhere else in the file; although it may make matching harder if you don't have the right record terminators any longer. Or, if your input data is encoded in UTF-16, it's easiest to convert it into UTF-8 for the grep: iconv -f UTF-16 -t UTF-8 < file | grep $modified_pattern \ | iconv -f UTF-8 -t UTF-16 -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
bug-grep <at> gnu.org:bug#32704; Package grep.
(2018年9月17日 15:57:02 GMT) Full text and rfc822 format available.Message #32 received at 32704 <at> debbugs.gnu.org (full text, mbox):
From: 21naown <at> gmail.com To: 32704 <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>, Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#32704: Can grep search for a line feed and a null character at the same time? Date: 2018年9月17日 17:56:52 +0200
Hello Assaf. Thank you Assaf and Eric for your suggestions. I will also look at the tool "pcregrep". -------------------------------------------------------------------------------- Thank you Eric for having answered the question of the subject: Le 15/09/2018 à 22:27, Eric Blake a écrit : > On 9/15/18 12:57 PM, 21naown <at> gmail.com wrote: > >> So it is possible to find "\x0A\x00" alone, but for example >> "\x74\x00\x0D\x00\x0A\x00\x74\x00\x6500円" is impossible to find with >> the "-P" option? > > Correct. It is impossible to find the record terminator in the middle > of a pattern, whether that terminator is \n (default) or NUL (-z). It > is therefore impossible to find a multi-record match using grep. The > string you listed contains both \x00 and \x0a, so regardless of which > of those two bytes you pick as the record terminator, it is impossible > to use grep to find that substring in your file. You'll have to > resort to a tool that supports multiline matching, since grep is not > such a tool.
Paul Eggert <eggert <at> cs.ucla.edu>
to control <at> debbugs.gnu.org.
(2020年9月21日 19:48:01 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.