Document AI Toolbox client libraries

This page shows how to get started with the Cloud Client Libraries for the Document AI Toolbox API. Client libraries make it easier to access Google Cloud APIs from a supported language. Although you can use Google Cloud APIs directly by making raw requests to the server, client libraries provide simplifications that significantly reduce the amount of code you need to write.

Read more about the Cloud Client Libraries and the older Google API Client Libraries in Client libraries explained.

Install the client library

Python

pip install --upgrade google-cloud-documentai-toolbox

For more information, see Setting Up a Python Development Environment.

Set up authentication

To authenticate calls to Google Cloud APIs, client libraries support Application Default Credentials (ADC); the libraries look for credentials in a set of defined locations and use those credentials to authenticate requests to the API. With ADC, you can make credentials available to your application in a variety of environments, such as local development or production, without needing to modify your application code.

For production environments, the way you set up ADC depends on the service and context. For more information, see Set up Application Default Credentials.

For a local development environment, you can set up ADC with the credentials that are associated with your Google Account:

Install the Google Cloud CLI. After installation, initialize the Google Cloud CLI by running the following command:
```
gcloudinit
```
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
If you're using a local shell, then create local authentication credentials for your user account:
```
gcloudauthapplication-defaultlogin
```
You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

A sign-in screen appears. After you sign in, your credentials are stored in the local credential file used by ADC.

Use the client library

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from a processed document response from JSON files in Cloud Storage, local JSON files, or output directly from the process_document() method.

It can perform the following actions:

Combine fragmented Document JSON files from Batch Processing into a single "wrapped" document.

Export shards as a unified Document.

Get Document output from:
Access text from Pages, Lines, Paragraphs, FormFields, and Tables without handling Layout information.
Search for a Pages containing a target string or matching a regular expression.
Search for FormFields by name.
Search for Entities by type.
Convert Tables to a Pandas Dataframe or CSV.
Insert Entities and FormFields into a BigQuery table.
Split a PDF file based on output from a Splitter/Classifier processor.
Extract image Entities from Document bounding boxes.
Convert Documents to and from commonly used formats:
- Cloud Vision API AnnotateFileResponse
- hOCR
- Third-party document processing formats
Create batches of documents for processing from a Cloud Storage folder.

Code Samples

The following code samples demonstrate how to use Document AI Toolbox.

Quickstart

fromtypingimport Optional
fromgoogle.cloudimport documentai
fromgoogle.cloud.documentai_toolboximport document , gcs_utilities
# TODO(developer): Uncomment these variables before running the sample.
# Given a Document JSON or sharded Document JSON in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
# Or, given a Document JSON in path gs://bucket/path/to/folder/document.json
# gcs_uri = "gs://bucket/path/to/folder/document.json"
# Or, given a Document JSON in path local/path/to/folder/document.json
# document_path = "local/path/to/folder/document.json"
# Or, given a Document object from Document AI
# documentai_document = documentai.Document()
# Or, given a BatchProcessMetadata object from Document AI
# operation = client.batch_process_documents(request)
# operation.result(timeout=timeout)
# batch_process_metadata = documentai.BatchProcessMetadata(operation.metadata)
# Or, given a BatchProcessOperation name from Document AI
# batch_process_operation = "projects/project_id/locations/location/operations/operation_id"
defquickstart_sample(
 gcs_bucket_name: Optional[str] = None,
 gcs_prefix: Optional[str] = None,
 gcs_uri: Optional[str] = None,
 document_path: Optional[str] = None,
 documentai_document: Optional[documentai.Document] = None,
 batch_process_metadata: Optional[documentai.BatchProcessMetadata ] = None,
 batch_process_operation: Optional[str] = None,
) -> document.Document:
 if gcs_bucket_name and gcs_prefix:
 # Load from Google Cloud Storage Directory
 print("Document structure in Cloud Storage")
 gcs_utilities .print_gcs_document_tree (
 gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
 )
 wrapped_document = document .Document.from_gcs (
 gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
 )
 elif gcs_uri:
 # Load a single Document from a Google Cloud Storage URI
 wrapped_document = document .Document.from_gcs_uri (gcs_uri=gcs_uri)
 elif document_path:
 # Load from local `Document` JSON file
 wrapped_document = document .Document.from_document_path (document_path)
 elif documentai_document:
 # Load from `documentai.Document` object
 wrapped_document = document .Document.from_documentai_document (
 documentai_document
 )
 elif batch_process_metadata:
 # Load Documents from `BatchProcessMetadata` object
 wrapped_documents = document .Document.from_batch_process_metadata (
 metadata=batch_process_metadata
 )
 wrapped_document = wrapped_documents[0]
 elif batch_process_operation:
 wrapped_documents = document .Document.from_batch_process_operation (
 location="us", operation_name=batch_process_operation
 )
 wrapped_document = wrapped_documents[0]
 else:
 raise ValueError("No document source provided.")
 # For all properties and methods, refer to:
 # https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document
 print("Document Successfully Loaded!")
 print(f"\t Number of Pages: {len(wrapped_document.pages)}")
 print(f"\t Number of Entities: {len(wrapped_document.entities)}")
 for page in wrapped_document.pages:
 print(f"Page {page .page_number}")
 for block in page .blocks:
 print(block.text )
 for paragraph in page .paragraphs:
 print(paragraph.text )
 for line in page .lines:
 print(line.text )
 for token in page .tokens:
 print(token.text )
 # Only supported with Form Parser processor
 # https://cloud.google.com/document-ai/docs/form-parser
 for form_field in page .form_fields:
 print(f"{form_field.field_name} : {form_field.field_value}")
 # Only supported with Enterprise Document OCR version `pretrained-ocr-v2.0-2023年06月02日`
 # https://cloud.google.com/document-ai/docs/process-documents-ocr#enable_symbols
 for symbol in page .symbols:
 print(symbol.text )
 # Only supported with Enterprise Document OCR version `pretrained-ocr-v2.0-2023年06月02日`
 # https://cloud.google.com/document-ai/docs/process-documents-ocr#math_ocr
 for math_formula in page .math_formulas:
 print(math_formula.text )
 # Only supported with Entity Extraction processors
 # https://cloud.google.com/document-ai/docs/processors-list
 for entity in wrapped_document.entities:
 print(f"{entity .type_} : {entity .mention_text}")
 if entity .normalized_text:
 print(f"\tNormalized Text: {entity .normalized_text}")
 # Only supported with Layout Parser
 for chunk in wrapped_document.chunks:
 print(f"Chunk {chunk.chunk_id}: {chunk.content}")
 for block in wrapped_document.document_layout_blocks:
 print(f"Document Layout Block {block.block_id}")
 if block.text_block:
 print(f"{block.text_block.type_}: {block.text_block.text }")
 if block.list_block:
 print(f"{block.list_block.type_}: {block.list_block.list_entries}")
 if block.table_block:
 print(block.table_block.header_rows, block.table_block.body_rows)

Tables


fromgoogle.cloud.documentai_toolboximport document
# TODO(developer): Uncomment these variables before running the sample.
# Given a local document.proto or sharded document.proto in path
# document_path = "path/to/local/document.json"
# output_file_prefix = "output/table"
deftable_sample(document_path: str, output_file_prefix: str) -> None:
 wrapped_document = document.Document.from_document_path(document_path=document_path)
 print("Tables in Document")
 for page in wrapped_document.pages:
 for table_index, table in enumerate(page.tables):
 # Convert table to Pandas Dataframe
 # Refer to https://pandas.pydata.org/docs/reference/frame.html for all supported methods
 df = table.to_dataframe()
 print(df)
 output_filename = f"{output_file_prefix}-{page.page_number}-{table_index}"
 # Write Dataframe to CSV file
 df.to_csv(f"{output_filename}.csv", index=False)
 # Write Dataframe to HTML file
 df.to_html(f"{output_filename}.html", index=False)
 # Write Dataframe to Markdown file
 df.to_markdown(f"{output_filename}.md", index=False)

BigQuery export


fromgoogle.cloud.documentai_toolboximport document
# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
# dataset_name = "test_dataset"
# table_name = "test_table"
# project_id = "YOUR_PROJECT_ID"
defentities_to_bigquery_sample(
 gcs_bucket_name: str,
 gcs_prefix: str,
 dataset_name: str,
 table_name: str,
 project_id: str,
) -> None:
 wrapped_document = document.Document.from_gcs(
 gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
 )
 job = wrapped_document.entities_to_bigquery(
 dataset_name=dataset_name, table_name=table_name, project_id=project_id
 )
 # Also supported:
 # job = wrapped_document.form_fields_to_bigquery(
 # dataset_name=dataset_name, table_name=table_name, project_id=project_id
 # )
 print("Document entities loaded into BigQuery")
 print(f"Job ID: {job.job_id}")
 print(f"Table: {job.destination.path}")

PDF split


fromgoogle.cloud.documentai_toolboximport document
# TODO(developer): Uncomment these variables before running the sample.
# Given a local document.proto or sharded document.proto from a splitter/classifier in path
# document_path = "path/to/local/document.json"
# pdf_path = "path/to/local/document.pdf"
# output_path = "resources/output/"
defsplit_pdf_sample(document_path: str, pdf_path: str, output_path: str) -> None:
 wrapped_document = document.Document.from_document_path(document_path=document_path)
 output_files = wrapped_document.split_pdf(
 pdf_path=pdf_path, output_path=output_path
 )
 print("Document Successfully Split")
 for output_file in output_files:
 print(output_file)

Image extraction


fromgoogle.cloud.documentai_toolboximport document
# TODO(developer): Uncomment these variables before running the sample.
# Given a local document.proto or sharded document.proto from an identity processor in path
# document_path = "path/to/local/document.json"
# output_path = "resources/output/"
# output_file_prefix = "exported_photo"
# output_file_extension = "png"
defexport_images_sample(
 document_path: str,
 output_path: str,
 output_file_prefix: str,
 output_file_extension: str,
) -> None:
 wrapped_document = document.Document.from_document_path(document_path=document_path)
 output_files = wrapped_document.export_images(
 output_path=output_path,
 output_file_prefix=output_file_prefix,
 output_file_extension=output_file_extension,
 )
 print("Images Successfully Exported")
 for output_file in output_files:
 print(output_file)

Vision conversion


fromgoogle.cloud.documentai_toolboximport document
# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
defconvert_document_to_vision_sample(
 gcs_bucket_name: str,
 gcs_prefix: str,
) -> None:
 wrapped_document = document.Document.from_gcs(
 gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
 )
 # Converting wrapped_document to vision AnnotateFileResponse
 annotate_file_response = (
 wrapped_document.convert_document_to_annotate_file_response()
 )
 print("Document converted to AnnotateFileResponse!")
 print(
 f"Number of Pages : {len(annotate_file_response.responses[0].full_text_annotation.pages)}"
 )

hOCR conversion


fromgoogle.cloud.documentai_toolboximport document
# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# document_path = "path/to/local/document.json"
# document_title = "your-document-title"
defconvert_document_to_hocr_sample(document_path: str, document_title: str) -> str:
 wrapped_document = document.Document.from_document_path(document_path=document_path)
 # Converting wrapped_document to hOCR format
 hocr_string = wrapped_document.export_hocr_str(title=document_title)
 print("Document converted to hOCR!")
 return hocr_string

Third-party conversion


fromgoogle.cloud.documentai_toolboximport converter
# TODO(developer): Uncomment these variables before running the sample.
# This sample will convert external annotations to the Document.json format used by Document AI Workbench for training.
# To process this the external annotation must have these type of objects:
# 1) Type
# 2) Text
# 3) Bounding Box (bounding boxes must be 1 of the 3 optional types)
#
# This is the bare minimum requirement to convert the annotations but for better accuracy you will need to also have:
# 1) Document width & height
#
# Bounding Box Types:
# Type 1:
# bounding_box:[{"x":1,"y":2},{"x":2,"y":2},{"x":2,"y":3},{"x":1,"y":3}]
# Type 2:
# bounding_box:{ "Width": 1, "Height": 1, "Left": 1, "Top": 1}
# Type 3:
# bounding_box: [1,2,2,2,2,3,1,3]
#
# Note: If these types are not sufficient you can propose a feature request or contribute the new type and conversion functionality.
#
# Given a folders in gcs_input_path with the following structure :
#
# gs://path/to/input/folder
# ├──test_annotations.json
# ├──test_config.json
# └──test.pdf
#
# An example of the config is in sample-converter-configs/Azure/form-config.json
#
# location = "us",
# processor_id = "my_processor_id"
# gcs_input_path = "gs://path/to/input/folder"
# gcs_output_path = "gs://path/to/input/folder"
defconvert_external_annotations_sample(
 location: str,
 processor_id: str,
 project_id: str,
 gcs_input_path: str,
 gcs_output_path: str,
) -> None:
 converter.convert_from_config(
 project_id=project_id,
 location=location,
 processor_id=processor_id,
 gcs_input_path=gcs_input_path,
 gcs_output_path=gcs_output_path,
 )

Document batches


fromgoogle.cloudimport documentai
fromgoogle.cloud.documentai_toolboximport gcs_utilities
# TODO(developer): Uncomment these variables before running the sample.
# Given unprocessed documents in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
# batch_size = 50
defcreate_batches_sample(
 gcs_bucket_name: str,
 gcs_prefix: str,
 batch_size: int = 50,
) -> None:
 # Creating batches of documents for processing
 batches = gcs_utilities .create_batches (
 gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix, batch_size=batch_size
 )
 print(f"{len(batches)} batch(es) created.")
 for batch in batches:
 print(f"{len(batch.gcs_documents.documents)} files in batch.")
 print(batch.gcs_documents.documents)
 # Use as input for batch_process_documents()
 # Refer to https://cloud.google.com/document-ai/docs/send-request
 # for how to send a batch processing request
 request = documentai.BatchProcessRequest (
 name="processor_name", input_documents=batch
 )
 print(request)

Merge Document shards


fromgoogle.cloudimport documentai
fromgoogle.cloud.documentai_toolboximport document
# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
# output_file_name = "path/to/folder/file.json"
defmerge_document_shards_sample(
 gcs_bucket_name: str, gcs_prefix: str, output_file_name: str
) -> None:
 wrapped_document = document .Document.from_gcs (
 gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
 )
 merged_document = wrapped_document.to_merged_documentai_document ()
 with open(output_file_name, "w") as f:
 f.write(documentai.Document.to_json(merged_document))
 print(f"Document with {len(wrapped_document.shards)} shards successfully merged.")

Additional resources

Python

The following list contains links to more resources related to the client library for Python:

Regional and multi-regional support

Client libraries

Document AI Toolbox client libraries Stay organized with collections Save and categorize content based on your preferences.

Install the client library

Python

Set up authentication

Use the client library

Code Samples

Quickstart

Tables

BigQuery export

PDF split

Image extraction

Vision conversion

hOCR conversion

Third-party conversion

Document batches

Merge Document shards

Additional resources

Python

Document AI Toolbox client libraries