Full Support of MuPDF OCR Interface · pymupdf/PyMuPDF · Discussion #1341

JorjMcKie
Oct 26, 2021
Maintainer

Since PyMuPDF v1.19.0, MuPDF's OCR functionality is fully supported.

In a nutshell, the following features are available:

Extract all text from any page of any document. The page may be just a scanned image, or a mixture of normal text and text contained in displayed images.
OCR any image and output it as a 1-page PDF. This allows appending a list of images as OCRed page to a new or existing PDF.
Selectively OCR text containing illegible characters - e.g. for an unsupported font.

For all of these use cases you will find example scripts and Jupyter notebooks in repository PyMuPDF Utilities.

MuPDF's OCR capabilities are based on Tesseract-OCR, which must be installed separately - it is not a Python package. The core parts of that software are however built into MuPDF (and are therefore also part of PyMuPDF's binary). During runtime, the only resources required are contained in Tesseract's language support folder, "tessdata". MuPDF needs to access data in this folder when OCR functions are being executed. To enable this, the environment variable TESSDATA_PREFIX must be defined and contain the name of this folder. Typically, it looks like C:\Program Files\Tesseract-OCR\tessdata on Windows or /usr/share/tesseract-ocr/4.00/tessdata on Unix-based systems.
It is not possible to set this variable by manipulating os.environ. You could start a system command from your script however, before using any OCR.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Full Support of MuPDF OCR Interface #1341

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

JorjMcKie
Oct 26, 2021
Maintainer

Replies: 0 comments

Select a reply

Uh oh!

Full Support of MuPDF OCR Interface #1341

Uh oh!

Uh oh!

JorjMcKie Oct 26, 2021 Maintainer

Replies: 0 comments

JorjMcKie
Oct 26, 2021
Maintainer