I often use grep -ao ...word file.bin
to look up a text content ("word") and the few characters before it; as a reminder:
-a, --text
Process a binary file as if it were text; this is equivalent to the --binary-files=text option.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate
output line.
Right; so just now, I realized that it behaves like this: I've tried first looking up the string war
and one character before it:
$ grep -ao .war myfile.zip
/war
9war
$war
ʆwar
Ok, so I get 4 hits here. Now, if I want to look up looking up the string war
and two more characters before it:
$ grep -ao ..war myfile.zip
>$war
So, now, for some reason, I get only one result?!
My guess is, the value of "two characters before" in the three missing cases is 0x00 (end of C string), so grep
does not output that match - otherwise I would have still expected 4 results (unless the very first match previously was at start of file, otherwise I would have expected 3 results).
Can I somehow persuade grep
to simply "ignore" null bytes in matches (or replace them with a dot or something) and still print matches that might have them? If not grep
, is there another tool that could do this?
2 Answers 2
There are at least two (well three) problems with your approach.
Even with the non-standard -o
, grep
is line based, in that it finds all the matched to o
utput on each line, lines being sequences of characters delimited by a newline character (byte with value 10 / 0x0a on ASCII-based systems).
So:
grep -o ..war
Will only return war
instances that follow 2 characters (not bytes which is one of the 3 issues here) other than newline.
For instance, on an input like <0x0a>Xwar
, the 0x0a byte delimits the previous line, and the next line starts with Xwar
where there's only one character before war
.
In a UTF-8 locale, on an input like <0xff><0xc3><0xa9>war
, the two bytes <0xc3><0xa9>
form the é
character, but the 0xff byte before it is invalid, so can't form a character.
grep
in general is only meant to work on text, so depending on the grep
implementation, input with NUL characters or with overlong lines or not ending in newline characters could put a spanner in the works.
Then, in xxwarwar
, grep -o
will find xxwar
, but resume search for more matches after that, so won't find arwar
.
Those issues could be addressed by using perl
and:
perl -l -0777 -ne 'print "1ドル2ドル" while m{(?<=(..))(war)}sg'
Where we find instances of war
that are preceded by any 2 bytes (not characters in the user's locale), using a look-behind operator for those preceding bytes so as not consume the input. With -0777
, which sets a record separator as something impossible, we work on the whole input as opposed to each line in the input.
-
1+1 on the
-0777
part. I was looking for a similar solution, and couldn't think of something as elegant. Had to resort toopen
andread
with 1 byte. Would have never thought providing a non-existing delimiter. I was wondering if that was a known "hack", or something you came up with.aviro– aviro2023年03月20日 05:50:25 +00:00Commented Mar 20, 2023 at 5:50 -
1@aviro, it's a well known and documented feature also called slurp mode. See
perldoc perlrun
. It's the equivalent of$/ = undef
. Since perl 5.36, there's-g
as an alias to it.Stéphane Chazelas– Stéphane Chazelas2023年03月20日 06:29:36 +00:00Commented Mar 20, 2023 at 6:29
An alternate method is to convert binary to hex and match:
hexdump -v -e '/1 "%02X" " "' file.bin | grep -o ".. .. $(printf "war" | hexdump -v -e '/1 "%02X" " "')"
You need one ..
for each byte .
you want to match before the string. The downside is that it's slower than direct matches with grep or perl, and won't find the subsequent match in consecutive patterns like warwar
as the perl solution
The result will be printed as hex values instead. If you want to print the result as string then convert the bytes back like this
hexdump -v -e '/1 "%02X" " "' file.bin | \
grep -o ".. .. $(printf "war" | hexdump -v -e '/1 "%02X" " "')" | \
xargs -d '\n' -n 1 bash -c '<<<"1ドル" xxd -r -p -; echo' bash
However beware that \n
, \r
and many other control characters before the string will mess up the output
You can also make the search faster by not printing a space after each byte with the caveat that false positives may appear because the hex string matches in the middle of a byte. This way you'll match ..
instead of ..
for each .
hexdump -v -e '/1 "%02X" ""' file.bin | \
grep -o "....$(printf "war" | hexdump -v -e '/1 "%02X" ""')" | \
xargs -d '\n' -n 1 bash -c '<<<"1ドル" xxd -r -p -; echo' bash
a\nbwar
is not matched by..war
, as well.war
? For instance, in the example @MarcusMüller provided,a\nbwar
, would you like to show\nbwar
orabwar
? What is the purpose of your question?