Name	Name	Last commit message	Last commit date
Latest commit History 489 Commits
.github	.github
src	src
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
jbang-catalog.json	jbang-catalog.json
pom.xml	pom.xml

tabula-java Build Status

tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powers Tabula (repo). You can use tabula-java as a command-line tool to programmatically extract tables from PDFs.

Download

Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from our releases page.

Commandline Usage Examples

tabula-java provides a command line application:

$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-f <FORMAT>]
 [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s
 <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
 -a,--area <AREA> -a/--area = Portion of the page to analyze.
 Example: --area 269.875,12.75,790.5,561.
 Accepts top,left,bottom,right i.e. y1,x1,y2,x2
 where all values are in points relative to the
 top left corner. If all values are between
 0-100 (inclusive) and preceded by '%', input
 will be taken as % of actual height or width
 of the page. Example: --area %0,0,100,50. To
 specify multiple areas, -a option should be
 repeated. Default is entire page
 -b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
 -c,--columns <COLUMNS> X coordinates of column boundaries. Example
 --columns 10.1,20.2,30.3. If all values are
 between 0-100 (inclusive) and preceded by '%',
 input will be taken as % of actual width of
 the page. Example: --columns %25,50,80.6
 -f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
 -g,--guess Guess the portion of the page to analyze per
 page.
 -h,--help Print this help text.
 -i,--silent Suppress all stderr output.
 -l,--lattice Force PDF to be extracted using lattice-mode
 extraction (if there are ruling lines
 separating each cell, as in a PDF of an Excel
 spreadsheet)
 -n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
 not to be extracted using spreadsheet-style
 extraction (if there are no ruling lines
 separating each cell)
 -o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
 Default: -
 -p,--pages <PAGES> Comma separated list of ranges, or all.
 Examples: --pages 1-3,5-7, --pages 3 or
 --pages all. Default is --pages 1
 -r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
 PDF to be extracted using spreadsheet-style
 extraction (if there are ruling lines
 separating each cell, as in a PDF of an Excel
 spreadsheet)
 -s,--password <PASSWORD> Password to decrypt document. Default is empty
 -t,--stream Force PDF to be extracted using stream-mode
 extraction (if there are no ruling lines
 separating each cell)
 -u,--use-line-returns Use embedded line returns in cells. (Only in
 spreadsheet mode.)
 -v,--version Print version and exit.

It also includes a debugging tool, run java -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h for the available options.

You can also integrate tabula-java with any JVM language. For Java examples, see the tests folder.

JVM start-up time is a lot of the cost of the tabula command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:

the -b option, which allows you to convert all pdfs in a given directory
the drip utility
the Ruby, Python, R, and Node.js bindings
writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
waiting for us to implement an API/server-style system (it's on the roadmap)

API Usage Examples

A simple Java code example which extracts all rows and cells from all tables of all pages of a PDF document:

InputStream in = this.getClass().getResourceAsStream("my.pdf");
try (PDDocument document = PDDocument.load(in)) {
 SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
 PageIterator pi = new ObjectExtractor(document).extract();
 while (pi.hasNext()) {
 // iterate over the pages of the document
 Page page = pi.next();
 List<Table> table = sea.extract(page);
 // iterate over the tables of the page
 for(Table tables: table) {
 List<List<RectangularTextContainer>> rows = tables.getRows();
 // iterate over the rows of the table
 for (List<RectangularTextContainer> cells : rows) {
 // print all column-cells of the row plus linefeed
 for (RectangularTextContainer content : cells) {
 // Note: Cell.getText() uses \r to concat text chunks
 String text = content.getText().replace("\r", " ");
 System.out.print(text + "|");
 }
 System.out.println();
 }
 }
 }
}

For more detail information check the Javadoc. The Javadoc API documentation can be generated (see also 'Building from Source' section) via

mvn javadoc:javadoc

which generates the HTML files to directory target/site/apidocs/

Building from Source

Clone this repo and run:

mvn clean compile assembly:single

Contributing

Interested in helping out? We'd love to have your help!

You can help by:

Reporting a bug.
Adding or editing documentation.
Contributing code via a Pull Request.
Spreading the word about tabula-java to people who might be able to benefit from using it.

Backers

You can also support our continued work on tabula-java with a one-time or monthly donation on OpenCollective. Organizations who use tabula-java can also sponsor the project for acknowledgement on our official site and this README.

Special thanks to the following users and organizations for generously supporting Tabula with donations and grants:

The John S. and James L. Knight Foundation The Shuttleworth Foundation

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

tabulapdf/tabula-java

Folders and files

Latest commit

History

Repository files navigation

tabula-java Build Status

Download

Commandline Usage Examples

API Usage Examples

Building from Source

Contributing

Backers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages

Uh oh!

Contributors 33

Languages

License

tabulapdf/tabula-java

Folders and files

Latest commit

History

Repository files navigation

tabula-java Build Status

Download

Commandline Usage Examples

API Usage Examples

Building from Source

Contributing

Backers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 33

Languages

Packages