Introduction
This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.
Module overview
After that you can view the extracted text boxes with the pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were successfully detected, they can be converted to a page grid with the extract module and their contents can be extracted using fit_texts_into_grid in the same module. extract also allows you to export the data as pandas DataFrame.
Based on the "PDF" category.
Alternatively, view pdftabextract alternatives based on common mentions on social networks and blogs.
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of pdftabextract or a related project?
July 2016 / Feb. 2017, Markus Konrad markus.konrad@wzb.eu / Berlin Social Science Center
IMPORTANT INITIAL NOTES
From time to time I receive emails from people trying to extract tabular data from PDFs. I'm fine with that and I'm glad to help. However, some people think that pdftabextract is some kind of magic wand that automatically extracts the data they want by simply running one of the provided examples on their documents. This, in the very most cases, won't work. I want to clear up a few things that you should consider before using this software and before writing an email to me:
pdftotext tool from poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts: pdftotext -layout yourdocument.pdf. This will create a file yourdocument.txt containing the recognized text (from the OCR) with a layout that hopefully resembles your tables. Often, this can be parsed directly (e.g. with a Python script using regular expressions). If it can't be parsed (e.g. if the columns are not well separated in the text, the tables on each page are too different to each other in order to come up with a common structure for parsing, the pages are too skewed or rotated) then pdftabextract is the right software for you.This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.
After that you can view the extracted text boxes with the
pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were successfully detected, they can be converted to a page grid with the extract module and their contents can be extracted using fit_texts_into_grid in the same module. extract also allows you to export the data as pandas DataFrame.
If your scanned pages are double pages, you will need to pre-process them with splitpages.
An extensive tutorial was posted here and is derived from the Jupyter Notebook contained in the examples. There are more use-cases and demonstrations in the examples directory.
common module)splitpages module)imgproc module)imgproc and textboxes module)clustering module)extract module)This package is available on PyPI and can be installed via pip: pip install pdftabextract
The requirements are listed in requirements.txt and are installed automatically if you use pip.
Only Python 3 -- No Python 2 support.
You need to convert your PDFs using the poppler-utils, a package which is part of most Linux distributions
and is also available for OSX via Homebrew or MacPorts. From this package we need the command pdftohtml and can create
an XML file in pdf2xml format in the following way using the Terminal:
pdftohtml -c -hidden -xml input.pdf output.xml
The arguments input.pdf and output.xml are your input PDF file and the created XML file in pdf2xml format respectively. It is important that you specifiy the -hidden parameter when you're dealing with OCR-processed ("sandwich") PDFs. You can furthermore add the parameters -f n and -l n to set only a range of pages to be converted.
For usage and background information, please read my series of blog posts about data mining PDFs.
See the following images of the example input/output:
original page
OCR PDF example in the viewer
Detected lines
Detected clusters of vertical lines (columns)
Generated page grid viewed in pdf2xml-viewer
Excerpt of the extracted data
Apache License 2.0. See LICENSE file.
*Note that all licence references and agreements mentioned in the pdftabextract README section above
are relevant to that project's source code only.
Do not miss the trending, packages, news and articles with our weekly report.