Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Extracting Tables from Documents (GUI)

Iydon Liang edited this page Jul 16, 2021 · 2 revisions

wxTableExtract.py

This is a script based on wxPython and PyMuPDF to browse a document and extract tables. It uses the method ParseTab contained in the same (examples) directory.

How it works

The script will at first present a file selection dialog to pick a document.

If the document is encrypted, a decryption password will be asked for. Then the document's first page will be displayed in another dialog. A number of controls at the dialog's top and left sides exist to do several things as follows. The availability of these controls depend on the situation. E.g. you can only add a column with New Col after a rectangle has been painted, etc.

  • You can browse forward and backward in the document using the buttons or the mouse wheel.
  • You can jump to a specific page.
  • You can paint a rectangle on a displayed page using the New Rect button. You can fine tune it by the spin controls. Pressing the New Rect button again will destroy any rectangles and columns. The same is true if you leave the page.
  • After a rectangle has been painted, you can paint one or more columns into it via button New Col. Columns are shown as vertical lines. You can modify a column by selecting it in the choice box and using the spin control. The column "under change" in this way, will change its colour from red to blue. A column can be deleted by entering "0" or a value outside the rectangle's borders in the spin control.
  • You can change a rectangle via the spin controls also after columns have been painted. This will not affect them in any way except when a column's coordinate leaves the rectangle area (then it will be deleted).
  • You can also move a rectangle around with the mouse (left key held down). In this case, any columns will go with it and will not get deleted.
  • Any time after a rectangle has been painted, you can parse the text that it surrounds by pressing button Get Table. The current script just prints the table to STDOUT if you do this - see the following example screens. You can repeatedly press this button to e.g. check the effect of new or deleted columns.

Displaying page 253 of Adobe's PDF manual:page 253

After painting a rectangle around TABLE 4.16 and pressing Get Table, the table's content is displayed using automatic column detection:TABLE 4.16

After painting additional columns into the rectangle and again pressing Get Table, a slightly different analysis of the table is displayed, based on the column information supplied:TABLE 4.16

Notes

  • ParseTab, and therefore also wxTableExtract are not OCR programs, any images will be ignored. They are text extraction programs.

  • If a logical table is physically spread across more than one page of the document, it is up to you to bind them together by any logic invoked by Get Table.

Recipes

HOWTO Button annots with JavaScript

HOWTO extract images

HOWTO join PDFs

HOWTO work with PDF embedded files

HOWTO Convert Images

HOWTO extract text from inside rectangles

HOWTO extract text in natural reading order

HOWTO add PDF form fields

HOWTO deal with annotations

HOWTO convert to PDF

HOWTO show PDF Form fields

HOWTO work with vector images

HOWTO create or extract graphics

HOWTO create your own PDF Drawing

HOWTO add pages, images, text

HOWTO extract fonts

HOWTO rearrange pages

HOWTO GUI PDF display

Algebra with geometry objects

Rectangle inclusion & intersection

Hyperlink maintenance

Visual table extraction

Incremental saves

Metadata & bookmark maintenance

Wrapping FileOptimizer

Installation

Ubuntu

Ubuntu Installation Experience

Windows Binaries

Windows Binaries Generation

Windows Binaries Installation

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /