How does grep decide that a file is binary?

Question 1

I have a large utf-8 text file which I frequently search with grep. Recently grep began reporting that it was a binary file. I can continue to search it with grep -a, but I was wondering what change made it decide that the file was now binary.

I have a copy from last month where the file is no longer detected as binary, but it's not practical to diff them since they differ on> 20,000 lines.

file identifies my file as

UTF-8 Unicode English text, with very long lines

How can I find the characters/lines/etc. in my file which are triggering this change?

The similar, non-duplicate question 19907 covers the possibility of NUL but grep -Pc '[\x00-\x1F]' says that I don't have NUL or any other ANSI control chaarcters.

Question 2

I would try this in this order: 1. Run it with strace/ltrace to check what input causes that 'binary' message 2. Check out grep's source and read it

Question 3

@muru: I'm using gnu grep, but if you have the answer for some other version I'd be interested as well.

Question 4

Odd. I have a file which I know contains a nul and some Escs. I tried grepping for them. I could find the escs (\x1B), but the nul never showed up. The test given above showed 1, for the line containing Escs, but nothing for any range that didn't contain \x1B. I wouldn't trust that test. Try grep -zc . instead (should be one more than the number of nuls in your file). (Also, you might be better off using [[:cntrl:]].)

Question 5

Also try: sed -z 's/.*$....$$/1円/' foo | od -c to see a few characters before the NUL (if there is one), which might lead you to the problem.

Question 6

@muru: My sed doesn't have a -z option: sed: invalid option -- 'z'.

Question 7

It appears to be the presence of the null character in the file.(displayed ^@ usually) I entered various control characters into a text file(like delete, ^?, for example), and only the null character caused grep to consider it a binary. This was only tested for grep. The less and diff commands, for instance, may have different methods. Control characters in general don't appear except in binaries. The exceptions are the whitespace characters: newline(^M), tab(^I), formfeed(^L), vertical tab(^K), and return(^J).

However, foreign characters, like arabic or chinese letters, are not standard ascii, and perhaps could be confused with control characters. Perhaps that's why it's only the null character.

You can test it out for yourself by insterting control characters into a text file using the text editor vim. Just go to insert mode, press control-v, and then the control character.

Question 8

A typical modern grep implementation should only declare a file "binary" if there are nul bytes inside. Anything else should be OK.

I cannot speak for the grep implementation you use...

Question 9

An encoding error according to mbrlen() also makes GNU grep 2.24 consider it as binary

E.g.:

export LC_CTYPE='en_US.UTF-8'
printf 'a\x80' | grep 'a'

because \x80 cannot be the first byte of an UTF-8 Unicode point: https://en.wikipedia.org/wiki/UTF-8#Description

This is the only other possibility besides NUL.

GNU grep source code interpretation that leads to this conclusion: What makes grep consider a file to be binary?

Question 10

That wasn't the case in v2.20, it would print a regardless of LANG and LC_*. Now grep v3.1 prints a when not using an UTF-8 locale, else it detects as binary. I also suspect 2.20 checked only once on a chunk of the file as I could grep without issue files that had null chunks in them, v3.1 now stops printing matches with the "Binary file ... matches" message. This is annoying as these chunks are common on files written concurrently by multiple processes (ex. ~/.bash_history) and lone nulls don't cause any issue on a terminal... :(

anotherguy anotherguy 5131 gold badge6 silver badges9 bronze badges · Answer 1 · 2015-09-17 21:30:18Z

It appears to be the presence of the null character in the file.(displayed ^@ usually) I entered various control characters into a text file(like delete, ^?, for example), and only the null character caused grep to consider it a binary. This was only tested for grep. The less and diff commands, for instance, may have different methods. Control characters in general don't appear except in binaries. The exceptions are the whitespace characters: newline(^M), tab(^I), formfeed(^L), vertical tab(^K), and return(^J).

However, foreign characters, like arabic or chinese letters, are not standard ascii, and perhaps could be confused with control characters. Perhaps that's why it's only the null character.

You can test it out for yourself by insterting control characters into a text file using the text editor vim. Just go to insert mode, press control-v, and then the control character.

schily schily 19.7k5 gold badges41 silver badges61 bronze badges · Answer 2 · 2015-09-17 22:10:44Z

A typical modern grep implementation should only declare a file "binary" if there are nul bytes inside. Anything else should be OK.

I cannot speak for the grep implementation you use...

score 2 · Answer 3 · 2016-04-12 20:55:42Z

2

An encoding error according to mbrlen() also makes GNU grep 2.24 consider it as binary

E.g.:

export LC_CTYPE='en_US.UTF-8'
printf 'a\x80' | grep 'a'

because \x80 cannot be the first byte of an UTF-8 Unicode point: https://en.wikipedia.org/wiki/UTF-8#Description

This is the only other possibility besides NUL.

GNU grep source code interpretation that leads to this conclusion: What makes grep consider a file to be binary?

Share

Improve this answer

edited Apr 13, 2017 at 12:36

Community's user avatar

Community Bot

1

answered Apr 12, 2016 at 20:55

Ciro Santilli OurBigBook.com's user avatar

Ciro Santilli OurBigBook.com Ciro Santilli OurBigBook.com

19.7k5 gold badges125 silver badges110 bronze badges

1

1

That wasn't the case in v2.20, it would print a regardless of LANG and LC_*. Now grep v3.1 prints a when not using an UTF-8 locale, else it detects as binary. I also suspect 2.20 checked only once on a chunk of the file as I could grep without issue files that had null chunks in them, v3.1 now stops printing matches with the "Binary file ... matches" message. This is annoying as these chunks are common on files written concurrently by multiple processes (ex. ~/.bash_history) and lone nulls don't cause any issue on a terminal... :(

Thomas Guyot-Sionnest
– Thomas Guyot-Sionnest

2024年09月27日 04:03:42 +00:00
Commented Sep 27, 2024 at 4:03

Add a comment |

Stack Exchange Network

How does grep decide that a file is binary?

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

How does grep decide that a file is binary?

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions