Automating extract–transform–load (ETL) pipelines for scanned business documents typically demands costly, finetuned, layout-aware models. We present a cloud-native architecture that transforms heterogeneous documents into a unified, structured JSON schema—without any model fine-tuning. Our pipeline combines off-the-shelf OCR (Azure Document Intelligence) with a schema-constrained large language model (LLM), guided by type-checked Pydantic outputs and a one-pass swap heuristic for efficient few-shot prompting. Evaluated on the FUNSD (form) and CORD (receipt) corpora, the system achieves 0.60 and 0.83 fuzzy KV F1 scores, respectively, while processing each page in under eight seconds at under 0ドル.004 on standard cloud quota. Scaling to a larger LLM boosts CORD accuracy to 0.89 F1 at under 0ドル.02 per page. The entire pipeline—code, prompts, and metric scripts—is open-sourced, enabling lightweight, fully-deployable semantic ETL for small-to medium-scale workloads.
Structura/
├── docs/
│ ├── Few-Shot Optimization Pipeline.pdf
│ └── System Architecture Diagram.pdf
├── paper/
├── src/
│ ├── benchmark.py
│ ├── clients.py
│ ├── inference.py
│ ├── main.py
│ ├── metrics.py
│ ├── optimizer.py
│ ├── schemas.py
│ ├── system_prompt.py
│ ├── benchmarks/
│ ├── datasets/
│ │ ├── cord/
│ │ └── funsd/
│ └── prompts/
│ ├── cord/
│ └── funsd/
├── LICENSE
├── README.MD
└── requirements.txt
-
Python 3.10+ is recommended.
-
Create a virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Configure environment variables. Copy the provided template and fill in values from your Azure resources:
cp .env.template .env
Environment keys consumed by the code:
# Document Intelligence Endpoint AZUREDOCINTEL_BASE_URI: Azure Document Intelligence endpoint AZUREDOCINTEL_TOKEN: Azure Document Intelligence API key # Azure OpenAI Endpoint AZUREOPENAI_BASE_URI: Azure OpenAI endpoint AZUREOPENAI_API_TOKEN: Azure OpenAI API key AZUREOPENAI_API_VERSION: Azure OpenAI API version AZUREOPENAI_MODEL_NAME: Deployed Azure OpenAI model name
The library loads variables via dotenv at import time (see src/clients.py).
Sample datasets are included under src/datasets/:
cord/: CORD receipt corpus (images and JSON annotations)funsd/: FUNSD form corpus (images and JSON annotations)
Each dataset has images/ and annotations/ directories. Filenames (without extension) align between image and JSON.
The default entrypoint runs CORD with gpt-4o-mini, generates few-shot exemplars, evaluates, and iteratively improves the exemplar set with a one-pass swap heuristic.
python src/main.py
Artifacts are written to src/benchmarks/ as JSON plus failure reports for timeouts/errors.
Adjust high-level settings in src/main.py:
dataset_name: one ofcordorfunsdschema: a Pydantic schema fromsrc/schemas.py(e.g.,CORDSchema)model: your Azure OpenAI deployment name (e.g.,gpt-4o-mini)fewshot_count,fewshot_z_swap,max_test_size: exemplar and evaluation sizes
from src.inference import get_response from src.schemas import CORDSchema from src.system_prompt import get_system_prompt system_prompt = get_system_prompt(train_set=["075", "153"], dataset="cord", overwrite=False, use_fewshot=True) ocr_text, ocr_ms, llm_json, llm_ms = get_response( system_prompt=system_prompt, pydantic_schema=CORDSchema, model_name="gpt-4o-mini", file_path="src/datasets/cord/images/000.png", temperature=0.3, ) print(llm_json)
src/benchmark.py computes metrics and aggregates results. Metrics include fuzzy and exact KV F1, canonical F1 (Hungarian alignment), value quality, and confusion statistics (see src/metrics.py).
Core execution path:
- OCR via Azure Document Intelligence
prebuilt-layoutwith KEY_VALUE_PAIRS (src/inference.get_docintel_result). - Prompt construction from dataset templates plus generated exemplars (
src/system_prompt.py). - LLM call through Azure OpenAI with
instructorfor schema-constrained Pydantic outputs (src/inference.get_instructor_response). - Structured JSON validation by Pydantic models in
src/schemas.py. - Parallelization with thread pools for OCR and LLM, bounded connection pools, and lightweight rate limiting for OCR posts (
src/inference.py,src/benchmark.py).
Few-Shot Optimization Pipeline
This repository implements a one-pass swap heuristic (see src/optimizer.py):
- Select an initial exemplar set and a disjoint test set (
src/system_prompt.get_random_train_set). - Evaluate training exemplars without few-shot examples to estimate individual utility.
- Generate a test-time system prompt with few-shot examples and evaluate on the test set.
- Swap out the best-performing training exemplars for the worst-performing test samples (z-swap).
- Iterate until the train/test sets stabilize or the iteration budget is reached.
The few-shot examples are materialized into src/prompts/<dataset>/fewshot_examples.txt and combined with src/prompts/<dataset>/prompt.txt.
- FUNSD (forms), fuzzy KV F1: 0.60
- CORD (receipts), fuzzy KV F1: 0.83
- Larger LLM on CORD: 0.89
- Throughput: <8 seconds per page on standard cloud quota
- Cost: <0ドル.004 per page with the small model; <0ドル.02 with a larger model
Empirical outputs for this repository appear under src/benchmarks/.
Key tunables (edit in code):
src/main.py: dataset/model selection, exemplar counts, temperaturesrc/benchmark.py: parallelism, timeouts/retries, output file namingsrc/metrics.py: thresholds for fuzzy matching and canonical alignment
@article{gupta2025llm,
author = {Gupta, Shreyan},
title = {An LLM-Based ETL Architecture for Semantic
Normalization of Unstructured Data},
url = {https://doi.org/10.5281/zenodo.16786494},
year = 2025
doi = {10.5281/zenodo.16786494},
version = {v1},
journal = {Preprint submitted to IEEE MIT Undergraduate
Research Technology Conference (URTC) 2025},
}
MIT License. See LICENSE.