Skip to content

How to extract text in natural reading order (up2down, left2right)

Jump to bottom Edit New page

Aaron Taylor edited this page Jun 11, 2023 · 2 revisions

Easiest way

First of all, use SortedCollection.

from operator import itemgetter
from itertools import groupby
import fitz
doc = fitz.open( 'mydocument.pdf' )
for page in doc:
 text_words = page.get_text_words()
 # The words should be ordered by y1 and x0
 sorted_words = SortedCollection( key = itemgetter( 3, 0 ) )
 for word in text_words:
 sorted_words.insert( word )
 # At this point you already have an ordered list. If you need to 
 # group the content by lines, use groupby with y1 as a key
 lines = groupby( sorted_words, key = itemgetter( 3 ) )
 # Enjoy!

Recipes

HOWTO Button annots with JavaScript

HOWTO extract images

HOWTO join PDFs

HOWTO work with PDF embedded files

HOWTO Convert Images

HOWTO extract text from inside rectangles

HOWTO extract text in natural reading order

HOWTO add PDF form fields

HOWTO deal with annotations

HOWTO convert to PDF

HOWTO show PDF Form fields

HOWTO work with vector images

HOWTO create or extract graphics

HOWTO create your own PDF Drawing

HOWTO add pages, images, text

HOWTO extract fonts

HOWTO rearrange pages

HOWTO GUI PDF display

Algebra with geometry objects

Rectangle inclusion & intersection

Hyperlink maintenance

Visual table extraction

Incremental saves

Metadata & bookmark maintenance

Wrapping FileOptimizer

Installation

Ubuntu

Ubuntu Installation Experience

Windows Binaries

Windows Binaries Generation

Windows Binaries Installation

Clone this wiki locally