Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Full Support of MuPDF OCR Interface #1341

JorjMcKie started this conversation in Show and tell
Discussion options

Since PyMuPDF v1.19.0, MuPDF's OCR functionality is fully supported.

In a nutshell, the following features are available:

  • Extract all text from any page of any document. The page may be just a scanned image, or a mixture of normal text and text contained in displayed images.
  • OCR any image and output it as a 1-page PDF. This allows appending a list of images as OCRed page to a new or existing PDF.
  • Selectively OCR text containing illegible characters - e.g. for an unsupported font.

For all of these use cases you will find example scripts and Jupyter notebooks in repository PyMuPDF Utilities.

MuPDF's OCR capabilities are based on Tesseract-OCR, which must be installed separately - it is not a Python package. The core parts of that software are however built into MuPDF (and are therefore also part of PyMuPDF's binary). During runtime, the only resources required are contained in Tesseract's language support folder, "tessdata". MuPDF needs to access data in this folder when OCR functions are being executed. To enable this, the environment variable TESSDATA_PREFIX must be defined and contain the name of this folder. Typically, it looks like C:\Program Files\Tesseract-OCR\tessdata on Windows or /usr/share/tesseract-ocr/4.00/tessdata on Unix-based systems.
It is not possible to set this variable by manipulating os.environ. You could start a system command from your script however, before using any OCR.

You must be logged in to vote

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant

AltStyle によって変換されたページ (->オリジナル) /