-
Notifications
You must be signed in to change notification settings - Fork 651
Full Support of MuPDF OCR Interface #1341
-
Since PyMuPDF v1.19.0, MuPDF's OCR functionality is fully supported.
In a nutshell, the following features are available:
- Extract all text from any page of any document. The page may be just a scanned image, or a mixture of normal text and text contained in displayed images.
- OCR any image and output it as a 1-page PDF. This allows appending a list of images as OCRed page to a new or existing PDF.
- Selectively OCR text containing illegible characters - e.g. for an unsupported font.
For all of these use cases you will find example scripts and Jupyter notebooks in repository PyMuPDF Utilities.
MuPDF's OCR capabilities are based on Tesseract-OCR, which must be installed separately - it is not a Python package. The core parts of that software are however built into MuPDF (and are therefore also part of PyMuPDF's binary). During runtime, the only resources required are contained in Tesseract's language support folder,
"tessdata"
. MuPDF needs to access data in this folder when OCR functions are being executed. To enable this, the environment variableTESSDATA_PREFIX
must be defined and contain the name of this folder. Typically, it looks likeC:\Program Files\Tesseract-OCR\tessdata
on Windows or/usr/share/tesseract-ocr/4.00/tessdata
on Unix-based systems.
It is not possible to set this variable by manipulatingos.environ
. You could start a system command from your script however, before using any OCR.
Beta Was this translation helpful? Give feedback.