When I want to recursively search TeX files for characters unsupported by my font, I typically start with a search for non-breakable spaces and zero-width spaces. These are difficult to produce on the terminal command line, therefore I use their UTF-8 hexidecimal representations.
env LANG=C grep -obUaP "\xc2\xa0" $(find -name '*.tex')
env LANG=C grep -obUaP "\xe2\x80\x8b" $(find -name '*.tex')
Why do I need to explicitly set the LANG
environment variable to C
as shown above: env LANG=C
Notes
Using -U
and -a
simultaneously may seem erroneous, but this version of the manual states that
When type is ‘binary’, grep may treat non-text bytes as line terminators even without the -z (--null-data) option.
-a
forces only line terminators to be line terminators (not so clear).
http://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html
1 Answer 1
My version of the grep manual does not include this, but the grep 3.0 elaborates on this topic.
Warning: The -a (--binary-files=text) option might output binary garbage, which can have nasty side effects if the output is a terminal and if the terminal driver interprets some of it as commands. On the other hand, when reading files whose text encodings are unknown, it can be helpful to use -a or to set ‘LC_ALL='C'’ in the environment, in order to find more matches even if the matches are unsafe for direct display.
From this answer: https://unix.stackexchange.com/a/87763/33386
In the C locale, characters are single bytes, the charset is ASCII [...]
This probably is the reason why this helps with the display of characters when scanning unknown text files. It forces an ASCII character set.
-U
and-a
?-U
changes the behavior of grep to allow for a bytewise search. As far as-a
goes, the man page does not elaborate on the matter, but it says that-a
makesgrep
process bytes as text, which is correct in this case. I am open for criticism, suggestions, or reasoning to improve my knowledge here.-U
(--binary
) and-a
(--text
) are contradictory ;-). Or were you thinking of-u
(--unix-byte-offsets
)?-U
, any non-text bytes including line terminators are treated as line terminators.-a
forces only line terminators to be treated as line terminators. So using-a
is merely a precaution to ensure that the line numbering is correct in the output when you know that that input is supposed to be text.