pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

[画像:WZBSocialScienceCenter logo]

datascience.blog.wzb.eu Source Code Changelog

Suggest Changes

Popularity

6.4

Stable

Activity

0.0

Stable

Stars 2,253

Watchers 83

Forks 372

Last Commit over 3 years ago

Description

Introduction

This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.

Module overview

After that you can view the extracted text boxes with the pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were successfully detected, they can be converted to a page grid with the extract module and their contents can be extracted using fit_texts_into_grid in the same module. extract also allows you to export the data as pandas DataFrame.

Code Quality Rank: L3

Programming language: Python

License: Apache License 2.0

Tags: Text Processing Specific Formats Processing PDF OCR Scientific Engineering Information Analysis Utilities Data Mining Scanned Documents

Latest version: v0.3.0

pdftabextract alternatives and similar packages

Based on the "PDF" category.
Alternatively, view pdftabextract alternatives based on common mentions on social networks and blogs.

PyPDF2

8.8 9.5 L2 pdftabextract VS PyPDF2

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

py-pdf logo
PyMuPDF

8.5 9.7 pdftabextract VS PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pymupdf logo

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.

Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

Promo getstream.io

[画像:Stream Logo]

WeasyPrint

8.5 9.7 L1 pdftabextract VS WeasyPrint

The awesome document factory

Kozea logo
PDFMiner

8.3 0.0 L3 pdftabextract VS PDFMiner

DISCONTINUED. Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Kreuzberg

7.7 10.0 pdftabextract VS Kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

kreuzberg-dev logo
Camelot

7.2 8.2 pdftabextract VS Camelot

A Python library to extract tabular data from PDFs

camelot-dev logo
borb

6.8 8.9 pdftabextract VS borb

borb is a library for reading, creating and manipulating PDF files in python.

borb-pdf logo
plutoprint

4.3 9.4 pdftabextract VS plutoprint

A Python Library for Generating PDFs and Images from HTML, powered by PlutoBook

plutoprint logo
ReportLab

3.4 - pdftabextract VS ReportLab

Allowing Rapid creation of rich PDF documents.
Meltano Singer SDK

2.7 9.8 pdftabextract VS Meltano Singer SDK

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com

meltano logo

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of pdftabextract or a related project?

Add another 'PDF' Package

InfluxDB – Built for High-Performance Time Series Workloads

featured www.influxdata.com

Popular Comparisons

SaaSHub - Software Alternatives and Reviews

featured www.saashub.com

README

pdftabextract - A set of tools for data mining (OCR-processed) PDFs

July 2016 / Feb. 2017, Markus Konrad markus.konrad@wzb.eu / Berlin Social Science Center

IMPORTANT INITIAL NOTES

From time to time I receive emails from people trying to extract tabular data from PDFs. I'm fine with that and I'm glad to help. However, some people think that pdftabextract is some kind of magic wand that automatically extracts the data they want by simply running one of the provided examples on their documents. This, in the very most cases, won't work. I want to clear up a few things that you should consider before using this software and before writing an email to me:

pdftabextract is not an OCR (optical character recognition) software. It requires scanned pages with OCR information, i.e. a "sandwich PDF" that contains both the scanned images and the recognized text. You need software like tesseract or ABBYY Finereader for OCR. In order to check if you have a "sandwich PDF", open your PDF and press "select all". This usually reveals the OCR-processed text information.
pdftabextract is some kind of last resort when all other things fail for extracting tabular data from PDFs. Before trying this out, you should ask yourself the following questions:
- Is there really no other way / no other format for which the data is available?
- Can a special OCR software like ABBYY Finereader detect and extract the tables (you need to try this with a large sample of pages -- I found the table recognition in Finereader often unreliable)?
- Is it possible to extract the recognized text as-is from the PDFs and parse it? Try using the pdftotext tool from poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts: pdftotext -layout yourdocument.pdf. This will create a file yourdocument.txt containing the recognized text (from the OCR) with a layout that hopefully resembles your tables. Often, this can be parsed directly (e.g. with a Python script using regular expressions). If it can't be parsed (e.g. if the columns are not well separated in the text, the tables on each page are too different to each other in order to come up with a common structure for parsing, the pages are too skewed or rotated) then pdftabextract is the right software for you.
pdftabextract is a set of tools. As such, it contains functions that are suitable for certain documents but not for others and many functions require you to set parameters that depend on the layout, scan quality, etc. of your documents. You can't just use the example scripts blindly with your data. You will need to adjust parameters in order that it works well with your documents. Below are some hints and explanations regarding those tools and their parameters.

Introduction

This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.

Module overview

After that you can view the extracted text boxes with the pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were successfully detected, they can be converted to a page grid with the extract module and their contents can be extracted using fit_texts_into_grid in the same module. extract also allows you to export the data as pandas DataFrame.

If your scanned pages are double pages, you will need to pre-process them with splitpages.

Examples and tutorials

An extensive tutorial was posted here and is derived from the Jupyter Notebook contained in the examples. There are more use-cases and demonstrations in the examples directory.

Features

load and parse files in pdf2xml format (common module)
split scanned double pages (splitpages module)
detect lines in scanned pages via image processing (imgproc module)
detect page rotation or skew and fix it (imgproc and textboxes module)
detect clusters in detected lines or text box positions in order to find column and row positions (clustering module)
extract tabular data and convert it to pandas DataFrame (which allows export to CSV, Excel, etc.) (extract module)

Installation

This package is available on PyPI and can be installed via pip: pip install pdftabextract

Requirements

The requirements are listed in requirements.txt and are installed automatically if you use pip.

Only Python 3 -- No Python 2 support.

Converting PDF files to XML files with pdf2xml format

You need to convert your PDFs using the poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts. From this package we need the command pdftohtml and can create an XML file in pdf2xml format in the following way using the Terminal:

pdftohtml -c -hidden -xml input.pdf output.xml

The arguments input.pdf and output.xml are your input PDF file and the created XML file in pdf2xml format respectively. It is important that you specifiy the -hidden parameter when you're dealing with OCR-processed ("sandwich") PDFs. You can furthermore add the parameters -f n and -l n to set only a range of pages to be converted.

Usage and examples

For usage and background information, please read my series of blog posts about data mining PDFs.

See the following images of the example input/output:

Original page

original page

Generated (and skewed) pdf2xml file viewed with pdf2xml-viewer

OCR PDF example in the viewer

Detected lines

Detected clusters of vertical lines (columns)

Generated page grid viewed in pdf2xml-viewer

Excerpt of the extracted data

License

Apache License 2.0. See LICENSE file.

*Note that all licence references and agreements mentioned in the pdftabextract README section above are relevant to that project's source code only.

Do not miss the trending, packages, news and articles with our weekly report.

Awesome Python is part of the LibHunt network. Terms. Privacy Policy.

(CC)

BY-SA

We recommend Spin The Wheel Of Names for a cryptographically secure random name picker.