2

When I want to recursively search TeX files for characters unsupported by my font, I typically start with a search for non-breakable spaces and zero-width spaces. These are difficult to produce on the terminal command line, therefore I use their UTF-8 hexidecimal representations.

env LANG=C grep -obUaP "\xc2\xa0" $(find -name '*.tex')
env LANG=C grep -obUaP "\xe2\x80\x8b" $(find -name '*.tex')

Why do I need to explicitly set the LANG environment variable to C as shown above: env LANG=C


Notes

Using -U and -a simultaneously may seem erroneous, but this version of the manual states that

When type is ‘binary’, grep may treat non-text bytes as line terminators even without the -z (--null-data) option.

-a forces only line terminators to be line terminators (not so clear).

http://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html

asked Oct 4, 2017 at 12:20
4
  • Why are you using both -U and -a? Commented Oct 4, 2017 at 13:07
  • 1
    @StephenKitt Good question. The -U changes the behavior of grep to allow for a bytewise search. As far as -a goes, the man page does not elaborate on the matter, but it says that -a makes grep process bytes as text, which is correct in this case. I am open for criticism, suggestions, or reasoning to improve my knowledge here. Commented Oct 4, 2017 at 13:12
  • I don’t know the answer in detail, I was just under the impression that -U (--binary) and -a (--text) are contradictory ;-). Or were you thinking of -u (--unix-byte-offsets)? Commented Oct 4, 2017 at 13:28
  • 1
    @StephenKitt The reason is that when reading in binary using -U, any non-text bytes including line terminators are treated as line terminators. -a forces only line terminators to be treated as line terminators. So using -a is merely a precaution to ensure that the line numbering is correct in the output when you know that that input is supposed to be text. Commented Oct 4, 2017 at 15:26

1 Answer 1

3

My version of the grep manual does not include this, but the grep 3.0 elaborates on this topic.

Warning: The -a (--binary-files=text) option might output binary garbage, which can have nasty side effects if the output is a terminal and if the terminal driver interprets some of it as commands. On the other hand, when reading files whose text encodings are unknown, it can be helpful to use -a or to set ‘LC_ALL='C'’ in the environment, in order to find more matches even if the matches are unsafe for direct display.

From this answer: https://unix.stackexchange.com/a/87763/33386

In the C locale, characters are single bytes, the charset is ASCII [...]

This probably is the reason why this helps with the display of characters when scanning unknown text files. It forces an ASCII character set.

answered Oct 5, 2017 at 7:01

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.