PDFdeconstruct
PDFdeconstruct™ decomposes PDF files into XML files. The XML
output includes:
- text – Unicode text with font, color, and position
data for each word (or each character)
- images – in PNG, TIFF, or JPEG format
- vector graphics – complete path information for fills
and strokes
- form fields – with field names and values
PDFdeconstruct can be used for:
- document format conversion: convert PDF to other formats
- document analysis: examine the content on a PDF page
- complex content extraction: e.g., input to further
processing based on text with position information
The PDFdeconstruct output format is described in the
manual.
PDFdeconstruct is a cross-platform command-line tool, suitable for use
on servers or for batch-mode processing.
Supported platforms:
- Windows
- Mac OS X
- Linux
- 32-bit and 64-bit versions available for all platforms
- other platforms: portable C++ source
code for the library is available
See also: For conversion to plain text (instead of XML), try
our XpdfText library.
Contact Glyph & Cog for more
information including evaluation copies.