Extracting Tables from Documents (GUI)

Iydon Liang edited this page Jul 16, 2021 · 2 revisions

wxTableExtract.py

This is a script based on wxPython and PyMuPDF to browse a document and extract tables. It uses the method ParseTab contained in the same (examples) directory.

How it works

The script will at first present a file selection dialog to pick a document.

If the document is encrypted, a decryption password will be asked for. Then the document's first page will be displayed in another dialog. A number of controls at the dialog's top and left sides exist to do several things as follows. The availability of these controls depend on the situation. E.g. you can only add a column with New Col after a rectangle has been painted, etc.

You can browse forward and backward in the document using the buttons or the mouse wheel.
You can jump to a specific page.
You can paint a rectangle on a displayed page using the New Rect button. You can fine tune it by the spin controls. Pressing the New Rect button again will destroy any rectangles and columns. The same is true if you leave the page.
After a rectangle has been painted, you can paint one or more columns into it via button New Col. Columns are shown as vertical lines. You can modify a column by selecting it in the choice box and using the spin control. The column "under change" in this way, will change its colour from red to blue. A column can be deleted by entering "0" or a value outside the rectangle's borders in the spin control.
You can change a rectangle via the spin controls also after columns have been painted. This will not affect them in any way except when a column's coordinate leaves the rectangle area (then it will be deleted).
You can also move a rectangle around with the mouse (left key held down). In this case, any columns will go with it and will not get deleted.
Any time after a rectangle has been painted, you can parse the text that it surrounds by pressing button Get Table. The current script just prints the table to STDOUT if you do this - see the following example screens. You can repeatedly press this button to e.g. check the effect of new or deleted columns.

Displaying page 253 of Adobe's PDF manual:page 253

After painting a rectangle around TABLE 4.16 and pressing Get Table, the table's content is displayed using automatic column detection:TABLE 4.16

After painting additional columns into the rectangle and again pressing Get Table, a slightly different analysis of the table is displayed, based on the column information supplied:TABLE 4.16

Notes

ParseTab, and therefore also wxTableExtract are not OCR programs, any images will be ignored. They are text extraction programs.
If a logical table is physically spread across more than one page of the document, it is up to you to bind them together by any logic invoked by Get Table.