Grep a number of bytes before match in binary files?

Question 1

I often use grep -ao ...word file.bin to look up a text content ("word") and the few characters before it; as a reminder:

 -a, --text
 Process a binary file as if it were text; this is equivalent to the --binary-files=text option.
 -o, --only-matching
 Print only the matched (non-empty) parts of a matching line, with each such part on a separate
 output line.

Right; so just now, I realized that it behaves like this: I've tried first looking up the string war and one character before it:

$ grep -ao .war myfile.zip
/war
9war
$war
ʆwar

Ok, so I get 4 hits here. Now, if I want to look up looking up the string war and two more characters before it:

$ grep -ao ..war myfile.zip
>$war

So, now, for some reason, I get only one result?!

My guess is, the value of "two characters before" in the three missing cases is 0x00 (end of C string), so grep does not output that match - otherwise I would have still expected 4 results (unless the very first match previously was at start of file, otherwise I would have expected 3 results).

Can I somehow persuade grep to simply "ignore" null bytes in matches (or replace them with a dot or something) and still print matches that might have them? If not grep, is there another tool that could do this?

Question 2

grep is line-based by default, so that a\nbwar is not matched by ..war, as well.

Question 3

What are you trying to achieve? Do you want to only show ASCII characters? Or do you want to check EVERY character before war? For instance, in the example @MarcusMüller provided, a\nbwar, would you like to show \nbwar or abwar? What is the purpose of your question?

Question 4

There are at least two (well three) problems with your approach.

Even with the non-standard -o, grep is line based, in that it finds all the matched to output on each line, lines being sequences of characters delimited by a newline character (byte with value 10 / 0x0a on ASCII-based systems).

So:

grep -o ..war

Will only return war instances that follow 2 characters (not bytes which is one of the 3 issues here) other than newline.

For instance, on an input like <0x0a>Xwar, the 0x0a byte delimits the previous line, and the next line starts with Xwar where there's only one character before war.

In a UTF-8 locale, on an input like <0xff><0xc3><0xa9>war, the two bytes <0xc3><0xa9> form the é character, but the 0xff byte before it is invalid, so can't form a character.

grep in general is only meant to work on text, so depending on the grep implementation, input with NUL characters or with overlong lines or not ending in newline characters could put a spanner in the works.

Then, in xxwarwar, grep -o will find xxwar, but resume search for more matches after that, so won't find arwar.

Those issues could be addressed by using perl and:

perl -l -0777 -ne 'print "1ドル2ドル" while m{(?<=(..))(war)}sg'

Where we find instances of war that are preceded by any 2 bytes (not characters in the user's locale), using a look-behind operator for those preceding bytes so as not consume the input. With -0777, which sets a record separator as something impossible, we work on the whole input as opposed to each line in the input.

Question 5

+1 on the -0777 part. I was looking for a similar solution, and couldn't think of something as elegant. Had to resort to open and read with 1 byte. Would have never thought providing a non-existing delimiter. I was wondering if that was a known "hack", or something you came up with.

Question 6

@aviro, it's a well known and documented feature also called slurp mode. See perldoc perlrun. It's the equivalent of $/ = undef. Since perl 5.36, there's -g as an alias to it.

Question 7

An alternate method is to convert binary to hex and match:

hexdump -v -e '/1 "%02X" " "' file.bin | grep -o ".. .. $(printf "war" | hexdump -v -e '/1 "%02X" " "')"

You need one .. for each byte . you want to match before the string. The downside is that it's slower than direct matches with grep or perl, and won't find the subsequent match in consecutive patterns like warwar as the perl solution

The result will be printed as hex values instead. If you want to print the result as string then convert the bytes back like this

hexdump -v -e '/1 "%02X" " "' file.bin | \
 grep -o ".. .. $(printf "war" | hexdump -v -e '/1 "%02X" " "')" | \
 xargs -d '\n' -n 1 bash -c '<<<"1ドル" xxd -r -p -; echo' bash

However beware that \n, \r and many other control characters before the string will mess up the output

You can also make the search faster by not printing a space after each byte with the caveat that false positives may appear because the hex string matches in the middle of a byte. This way you'll match .. instead of .. for each .

hexdump -v -e '/1 "%02X" ""' file.bin | \
 grep -o "....$(printf "war" | hexdump -v -e '/1 "%02X" ""')" | \
 xargs -d '\n' -n 1 bash -c '<<<"1ドル" xxd -r -p -; echo' bash

score 4 · Accepted Answer · 2023-03-19 18:04:19Z

There are at least two (well three) problems with your approach.

Even with the non-standard -o, grep is line based, in that it finds all the matched to output on each line, lines being sequences of characters delimited by a newline character (byte with value 10 / 0x0a on ASCII-based systems).

So:

grep -o ..war

Will only return war instances that follow 2 characters (not bytes which is one of the 3 issues here) other than newline.

For instance, on an input like <0x0a>Xwar, the 0x0a byte delimits the previous line, and the next line starts with Xwar where there's only one character before war.

In a UTF-8 locale, on an input like <0xff><0xc3><0xa9>war, the two bytes <0xc3><0xa9> form the é character, but the 0xff byte before it is invalid, so can't form a character.

grep in general is only meant to work on text, so depending on the grep implementation, input with NUL characters or with overlong lines or not ending in newline characters could put a spanner in the works.

Then, in xxwarwar, grep -o will find xxwar, but resume search for more matches after that, so won't find arwar.

Those issues could be addressed by using perl and:

perl -l -0777 -ne 'print "1ドル2ドル" while m{(?<=(..))(war)}sg'

Where we find instances of war that are preceded by any 2 bytes (not characters in the user's locale), using a look-behind operator for those preceding bytes so as not consume the input. With -0777, which sets a record separator as something impossible, we work on the whole input as opposed to each line in the input.

+1 on the -0777 part. I was looking for a similar solution, and couldn't think of something as elegant. Had to resort to open and read with 1 byte. Would have never thought providing a non-existing delimiter. I was wondering if that was a known "hack", or something you came up with.
@aviro, it's a well known and documented feature also called slurp mode. See perldoc perlrun. It's the equivalent of $/ = undef. Since perl 5.36, there's -g as an alias to it.

Stack Exchange Network

Grep a number of bytes before match in binary files?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Grep a number of bytes before match in binary files?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions