Is there a way to search PDF files using grep, without converting to text first in Ubuntu?
-
1See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2011年01月31日 20:01:14 +00:00Commented Jan 31, 2011 at 20:01
-
1For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?Martin Thoma– Martin Thoma2016年01月02日 22:09:32 +00:00Commented Jan 2, 2016 at 22:09
19 Answers 19
Install the package pdfgrep
, then use the command:
find /path -iname '*.pdf' -exec pdfgrep pattern {} +
——————
Simplest way to do that:
pdfgrep 'pattern' *.pdf
pdfgrep 'pattern' file.pdf
-
8This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.mikiemorales– mikiemorales2014年01月23日 01:28:56 +00:00Commented Jan 23, 2014 at 1:28
-
10Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.vowel-house-might– vowel-house-might2014年09月16日 11:11:48 +00:00Commented Sep 16, 2014 at 11:11
-
14
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to:pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.Rovanion– Rovanion2016年01月14日 12:11:28 +00:00Commented Jan 14, 2016 at 12:11 -
6This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's
pattern
? What's{}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.Mark Amery– Mark Amery2018年04月20日 14:44:53 +00:00Commented Apr 20, 2018 at 14:44 -
3@MarkAmery This answer is unnecessarily complex because he is
find
. The usage is simplypdfgrep 'pattern' file.pdf
. The{}
is just a way to drop the file name in fromfind
.Jonathan Cross– Jonathan Cross2018年08月28日 10:30:28 +00:00Commented Aug 28, 2018 at 10:30
If you have poppler-utils
installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep
:
pdftotext my.pdf - | grep 'pattern'
This won't create a .txt file.
-
21@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"Michael Mrozek– Michael Mrozek2011年01月31日 17:36:07 +00:00Commented Jan 31, 2011 at 17:36
-
6@akira Where do you see "grep only"?Michael Mrozek– Michael Mrozek2011年01月31日 18:55:22 +00:00Commented Jan 31, 2011 at 18:55
-
7@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not toMichael Mrozek– Michael Mrozek2011年02月01日 05:52:18 +00:00Commented Feb 1, 2011 at 5:52
-
2Nice solution. What is the purpose of the
-
character preceding the pipe? I observed that without it, or with any or character, the a file is created with the same name and the grep is not executed as expected. Is this how all linux piping is done when the intermediate file is unnecessary?sherrellbc– sherrellbc2015年12月04日 18:42:48 +00:00Commented Dec 4, 2015 at 18:42 -
4@sherrellbc The second argument of
pdftotext
is the filename it should write to. However, by convention, tools typically allow you to write tostdout
instead of to a file by specifying a-
instead. Similarly, some tools would write tostdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).Joost– Joost2016年09月23日 14:06:28 +00:00Commented Sep 23, 2016 at 14:06
pdfgrep was written for exactly this purpose and is available in Ubuntu.
It tries to be mostly compatible to grep
and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive
, --ignore-case
or --color
.
In contrast to pdftotext | grep
, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count
or --quiet
).
The basic usage is:
pdfgrep PATTERN FILE..
where PATTERN
is your search string and FILE
a list of filenames (or wildcards in a shell).
See the manpage for more infos.
-
2As of release 2.0
pdfgrep
has a--cache
option to drastically speed up multiple searches on the same files.Stefan Schmidt– Stefan Schmidt2022年10月24日 00:48:02 +00:00Commented Oct 24, 2022 at 0:48
No.
A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.
You can do that either per file with tools such as pdf2text
and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.
But no, you can not grep
pdf files and hope for reliable answers without extracting the text first.
-
15Considering
pdfgrep
exists (see above), a flat "no" is incorrect.Jonathan Cross– Jonathan Cross2018年08月28日 10:18:50 +00:00Commented Aug 28, 2018 at 10:18 -
4@JonathanCross, considering the question says "using the power of grep, without converting to text first", a flat "no" is correct.Jivan Pal– Jivan Pal2020年11月24日 08:58:51 +00:00Commented Nov 24, 2020 at 8:58
-
A good way to ask for an alternative without explicit requesting an alternative. Anyway, lucene as indexer works generally well.ChoCho– ChoCho2023年04月04日 15:02:38 +00:00Commented Apr 4, 2023 at 15:02
-
@JonathanCross The verb convert is ambiguous. To insist on one interpretation instead of a possibly different one in the mind of the OP is uncharitable pedantry.Pound Hash– Pound Hash2023年09月10日 21:08:40 +00:00Commented Sep 10, 2023 at 21:08
-
1@JonathanCross :) "use grep without convert to text" is the limiting part of the question. A question like "can I use $something on CLI to search for text in PDFs" would have yielded a different answer from me ... or none at all because those
pdftotext
answers exist already.akira– akira2023年09月21日 06:04:45 +00:00Commented Sep 21, 2023 at 6:04
Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.
There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;
The advantage over the similar answer here is the --with-filename
flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.
https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files
-
2I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.Bernhard– Bernhard2014年05月09日 12:07:10 +00:00Commented May 9, 2014 at 12:07
Take a look at the common resource grep tool crgrep which supports searching within PDF files.
It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
Here is a quick script for search pdf in the current directory :
#!/bin/bash
if [ $# -ne 1 ]; then
echo "usage 0ドル VALUE" 1>&2
exit 1
fi
echo 'SEARCH IS CASE SENSITIVE' 1>&2
find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "0ドル"' "1ドル" \;
-
I cannot edit this due to being to little: The
1ドル
in the find invocation should be quoted, otherwise this won't work with search terms with spaces.ankon– ankon2020年08月25日 11:44:02 +00:00Commented Aug 25, 2020 at 11:44 -
1
try this
find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; \
pdftotext "$i" - | grep pattern; done
for printing the lines the pattern occurs inside the pdf
You could pipe it through strings
first:-
cat file.pdf | strings | grep <...etc...>
-
11Just use
strings file.pdf | grep <...>
, you don't needcat
phunehehe– phunehehe2011年01月31日 14:31:19 +00:00Commented Jan 31, 2011 at 14:31 -
Yeah - my mind seems to work better with streams... :-)Andy Smith– Andy Smith2011年01月31日 14:57:06 +00:00Commented Jan 31, 2011 at 14:57
-
15wont work if text is compressed, which it is most of the times.akira– akira2011年01月31日 15:18:23 +00:00Commented Jan 31, 2011 at 15:18
-
9Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for
strings
orgrep
.Jander– Jander2011年01月31日 16:08:59 +00:00Commented Jan 31, 2011 at 16:08 -
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.hourback– hourback2015年11月24日 19:58:31 +00:00Commented Nov 24, 2015 at 19:58
If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings
you can use the below
grep -a STRING file.pdf
cat -v file.pdf | grep STRING
From grep --help
:
--binary-files=TYPE assume that binary files are TYPE;
TYPE is 'binary', 'text', or 'without-match'
-a, --text equivalent to --binary-files=text
and cat --help
:
-v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
ripgrep-all (or rga) enables ripgrep functionality on multiple file types, including PDFs.
cd to your folder containing your pdf-file and then..
pdfgrep 'pattern' your.pdf
or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)
pdfgrep 'pattern' `ls *.pdf`
or
pdfgrep 'pattern' $(ls *.pdf)
-
1why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use
ls
output as the input to other commands. Justpdfgrep 'pattern' *.pdf
is enoughphuclv– phuclv2019年01月31日 05:07:50 +00:00Commented Jan 31, 2019 at 5:07 -
@phuclv Your are wrong.
pdfgrep 'pattern' *.pdf
will not work.f0nzie– f0nzie2020年02月25日 19:55:37 +00:00Commented Feb 25, 2020 at 19:55 -
@f0nzie you're wrong.
$(ls *.pdf)
will be almost exactly the same as*.pdf
, only worse because special files are not protected in quotesphuclv– phuclv2020年02月26日 01:45:44 +00:00Commented Feb 26, 2020 at 1:45
pdfgrep -r --include "*.pdf" -i 'pattern'
-
3Welcome to the site, and thank you for your contribution. Could you add some explanation on what these options mean? This could also help explain how your approach differs from other answers to this question that also recommend
pdfgrep
.AdminBee– AdminBee2020年08月17日 09:53:09 +00:00Commented Aug 17, 2020 at 9:53
I assume you mean to not convert it on the disk, you can convert them to stdout
and then grep it with pdftotext
. Grepping the pdf without any sort of conversion is not a practical approach since PDF
is mostly a binary format.
In the directory:
ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {} - | grep "keyword"
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {} - | grep "keyword"
Also because some pdf
are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be grep
ed and OCR them.
I noticed if a pdf
file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts
.
First 2 lines of the pdffonts
are the table header, so when a file is searchable has more than two line output, knowing this we can create:
gedit check_pdf_searchable.sh
then paste this
#!/bin/bash
#set -vx
if ((`pdffonts "1ドル" | wc -l` < 3 )); then
echo 1ドル
pypdfocr "1ドル"
fi
then make it executable
chmod +x check_pdf_searchable.sh
then list all non-searchable pdfs in the directory:
ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}
gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep
to find some pattern.
-
There is no more gpdf or Evince in Ubuntu 25... Welcome to "GNOME Papers"!Андрей Тернити– Андрей Тернити2025年07月28日 07:34:22 +00:00Commented Jul 28 at 7:34
put this in your bashrc:
LESSOPEN="|/usr/bin/lesspipe %s"; export LESSOPEN
Then you can use less:
less mypdf.pdf | grep "Hello, World"
Check : https://www.zeuthen.desy.de/~friebel/unix/lesspipe.html : to get more about this.
-
lesspipe use pdftotext under the hood, so it's still better to use pdftotext directlyphuclv– phuclv2025年09月11日 01:52:08 +00:00Commented Sep 11 at 1:52
If you want to use a GUI you can try pdfgrepgui
.
Quickest way is
grep -rinw "pattern" --include \*.pdf *
-
1Welcome to the site. Would you mind adding more explanation to your proposed solution to make it more accessible to the non-expert? For example, your
grep
command-line searches recursively in sub-directories which someone not familiar withgrep
might be unaware of. Also, you included the-i
flag although ignoring the case may not always be what the user wants. In addition, please explain in what way your approach differs from the asnwer of e.g. @phuclv and others.AdminBee– AdminBee2020年01月21日 08:12:46 +00:00Commented Jan 21, 2020 at 8:12 -
1As AdminBee says, the question doesn’t ask for a case-insensitive search or a recursive directory search. The
-n
and-w
options aren’t justified by the question, either. But, more importantly, this answer tells how to search through text files whose names end with.pdf
— you’ve missed the point of the question.G-Man Says 'Reinstate Monica'– G-Man Says 'Reinstate Monica'2020年01月21日 08:22:55 +00:00Commented Jan 21, 2020 at 8:22 -
pdf is not a text file. This won't work most of the time, just like the above answers that also use grepphuclv– phuclv2025年09月11日 01:48:18 +00:00Commented Sep 11 at 1:48