Why is an explicit LANG=C required when searching for hex representations of characters in grep?

Question 1

When I want to recursively search TeX files for characters unsupported by my font, I typically start with a search for non-breakable spaces and zero-width spaces. These are difficult to produce on the terminal command line, therefore I use their UTF-8 hexidecimal representations.

env LANG=C grep -obUaP "\xc2\xa0" $(find -name '*.tex')
env LANG=C grep -obUaP "\xe2\x80\x8b" $(find -name '*.tex')

Why do I need to explicitly set the LANG environment variable to C as shown above: env LANG=C

Notes

Using -U and -a simultaneously may seem erroneous, but this version of the manual states that

When type is ‘binary’, grep may treat non-text bytes as line terminators even without the -z (--null-data) option.

-a forces only line terminators to be line terminators (not so clear).

http://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html

Question 2

Why are you using both -U and -a?

Question 3

@StephenKitt Good question. The -U changes the behavior of grep to allow for a bytewise search. As far as -a goes, the man page does not elaborate on the matter, but it says that -a makes grep process bytes as text, which is correct in this case. I am open for criticism, suggestions, or reasoning to improve my knowledge here.

Question 4

I don’t know the answer in detail, I was just under the impression that -U (--binary) and -a (--text) are contradictory ;-). Or were you thinking of -u (--unix-byte-offsets)?

Question 5

@StephenKitt The reason is that when reading in binary using -U, any non-text bytes including line terminators are treated as line terminators. -a forces only line terminators to be treated as line terminators. So using -a is merely a precaution to ensure that the line numbering is correct in the output when you know that that input is supposed to be text.

Question 6

My version of the grep manual does not include this, but the grep 3.0 elaborates on this topic.

Warning: The -a (--binary-files=text) option might output binary garbage, which can have nasty side effects if the output is a terminal and if the terminal driver interprets some of it as commands. On the other hand, when reading files whose text encodings are unknown, it can be helpful to use -a or to set ‘LC_ALL='C'’ in the environment, in order to find more matches even if the matches are unsafe for direct display.

From this answer: https://unix.stackexchange.com/a/87763/33386

In the C locale, characters are single bytes, the charset is ASCII [...]

This probably is the reason why this helps with the display of characters when scanning unknown text files. It forces an ASCII character set.

score 3 · Accepted Answer · 2017-10-05 07:01:27Z

My version of the grep manual does not include this, but the grep 3.0 elaborates on this topic.

Warning: The -a (--binary-files=text) option might output binary garbage, which can have nasty side effects if the output is a terminal and if the terminal driver interprets some of it as commands. On the other hand, when reading files whose text encodings are unknown, it can be helpful to use -a or to set ‘LC_ALL='C'’ in the environment, in order to find more matches even if the matches are unsafe for direct display.

From this answer: https://unix.stackexchange.com/a/87763/33386

In the C locale, characters are single bytes, the charset is ASCII [...]

This probably is the reason why this helps with the display of characters when scanning unknown text files. It forces an ASCII character set.

Stack Exchange Network

Why is an explicit LANG=C required when searching for hex representations of characters in grep?

Notes

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Why is an explicit LANG=C required when searching for hex representations of characters in grep?

Notes

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions