Name	Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore	.gitignore
README.md	README.md
UAE-Invoice-Template-1.jpg	UAE-Invoice-Template-1.jpg
UAE-Invoice-Template-1_extracted.docx	UAE-Invoice-Template-1_extracted.docx
UAE-Invoice-Template-1_extracted.txt	UAE-Invoice-Template-1_extracted.txt
UAE-Invoice-Template-1_extracted.xlsx	UAE-Invoice-Template-1_extracted.xlsx
image-extractor-fixed.py	image-extractor-fixed.py
install-fonts.py	install-fonts.py

Document Text Extractor | مستخرج نصوص المستندات | دستاویز متن ایکسٹریکٹر

A powerful tool that extracts text from images and PDFs using Amazon Bedrock's Claude AI models, with specialized formatting for financial documents.

Features

Text Extraction: Extract text from images (JPG, JPEG, PNG) and PDF documents
Multi-format Output: Save extracted text in multiple formats:
- Plain text (TXT)
- Word document (DOCX)
- Excel spreadsheet (XLSX)
Financial Document Intelligence: Automatically detects and formats financial documents like invoices, receipts, and bills
Arabic/Urdu Support: Includes automatic installation of Arabic and Urdu fonts for proper text rendering
Customizable: Configure AWS region, profile, model, and output formats

Prerequisites

AWS Account with access to Amazon Bedrock
AWS CLI configured with appropriate credentials
Python 3.6+

Required Python packages:

pip install boto3 openpyxl fpdf python-docx requests

Installation

Clone this repository or download the script

Install the required dependencies:

pip install boto3 openpyxl fpdf python-docx requests

Ensure your AWS CLI is configured with appropriate credentials and permissions for Amazon Bedrock
(Optional) Run the font installer script to pre-install required fonts:
```
python install-fonts.py
```

Usage

Basic usage:

python image-extractor-fixed.py /path/to/your/document.jpg

Advanced options:

python image-extractor-fixed.py /path/to/your/document.pdf --region us-east-1 --profile my-aws-profile --model anthropic.claude-3-5-sonnet-20240620-v1:0 --formats txt,docx,xlsx

Command Line Arguments

file_path: Path to the image or PDF file (required)
--region: AWS region for Bedrock (default: us-west-2)
--profile: AWS profile name (default: default)
--model: Bedrock model ID (default: anthropic.claude-3-5-sonnet-20240620-v1:0)
--formats: Output formats as comma-separated list (default: txt,docx,xlsx)

Font Installation

The tool includes automatic font installation for Arabic and Urdu text support. You can:

Let the tool install fonts automatically when needed (happens during first run)
Pre-install fonts using the included script:
```
python install-fonts.py
```
Install fonts manually if automatic installation fails:
- Amiri Regular: https://github.com/alif-type/amiri/raw/master/amiri-regular.ttf
- Noto Sans Arabic: https://github.com/googlefonts/noto-fonts/raw/main/hinted/ttf/NotoSansArabic/NotoSansArabic-Regular.ttf
- Noto Nastaliq Urdu: https://github.com/googlefonts/noto-fonts/raw/main/hinted/ttf/NotoNastaliqUrdu/NotoNastaliqUrdu-Regular.ttf

The fonts are installed to your user fonts directory (~/Library/Fonts on macOS).

How It Works

The Document Text Extractor operates through the following process:

Image/PDF Processing: The tool reads the input file and determines its media type based on the file extension.
Font Installation Check: The tool checks for required Arabic/Urdu fonts and installs them if needed.
AWS Bedrock Integration: The file is encoded in base64 and sent to Amazon Bedrock's Claude model via the invoke_model API.
Text Extraction: Claude analyzes the visual content and extracts all visible text from the document.
Document Type Analysis: The extracted text is analyzed to determine if it's a financial document by searching for keywords like "invoice," "receipt," etc. in multiple languages.
Structured Data Processing: For financial documents, the text is processed into structured sections (header information, item details, totals).
Format-Specific Output Generation: The extracted text is formatted and saved in the requested output formats with appropriate styling.

The tool leverages Claude's advanced vision capabilities to accurately extract text from various document types, including those with complex layouts or multilingual content.

Financial Document Detection and Processing

The financial document detection system works through these steps:

Keyword Detection: The tool scans the extracted text for financial keywords in multiple languages, including:
- English: invoice, bill, receipt, statement, purchase order, etc.
- Arabic: فاتورة, حساب, إيصال, أمر شراء, etc.
Section Identification: Once identified as a financial document, the text is processed using a state machine approach to identify three key sections:
- Header Section: Contains document metadata like invoice number, date, company information
- Item Section: Contains line items, quantities, prices, and descriptions
- Total Section: Contains subtotals, taxes, and final amounts
Pattern Recognition: The tool uses regular expressions to identify:
- Column headers (description, quantity, price, amount)
- Currency amounts and numerical data
- Key-value pairs (e.g., "Invoice Number: 12345")
Intelligent Formatting: Based on the identified structure, the tool creates:
- Properly aligned tables for item details
- Right-aligned numerical values
- Bold formatting for important information like totals
- Appropriate column widths based on content

This intelligent processing ensures that financial documents maintain their tabular structure and data relationships in all output formats.

Performance Considerations

When using the Document Text Extractor, keep these performance factors in mind:

File Size and Complexity:
- Larger files (>10MB) may take longer to process
- Complex layouts with multiple columns, tables, and mixed text/graphics require more processing time
- Consider compressing large images before processing
AWS Bedrock Quotas and Limits:
- Be aware of your AWS Bedrock service quotas for API calls
- Claude models have token limits (both input and output)
- Very large documents may need to be split into multiple pages
Network Considerations:
- Ensure stable internet connection for API calls
- Processing time includes network latency for sending files to AWS
Memory Usage:
- Processing large PDF files may require significant memory
- For multi-page documents, consider processing one page at a time
Cost Optimization:
- Claude API calls are billed based on input and output tokens
- Consider using smaller image resolutions when possible
- Batch processing multiple documents can be more efficient than individual calls
Font Installation:
- First-time font installation may add processing time
- Subsequent runs will be faster as fonts are already installed
- The tool now installs fonts in a background thread to avoid blocking the main process

For optimal performance, we recommend processing files under 5MB and ensuring your AWS account has appropriate rate limits for your expected usage volume.

Customizing the Claude Prompt

The tool uses a default prompt to instruct Claude on how to extract text from documents. You can customize this prompt by modifying the script:

Default Prompt: The current default prompt is simple and direct:
```
"Please extract all the text from this document."
```
Customization Options: You can modify the prompt in the main() function to provide more specific instructions, such as:
- Focusing on specific sections of the document
- Requesting particular formatting or organization of the extracted text
- Asking for additional analysis of the document content
- Specifying how to handle tables, charts, or other non-text elements

Example Custom Prompts:

# For financial documents
"Please extract all text from this invoice. Organize it into sections for header information, line items, and totals."
# For multilingual documents
"Extract all text from this document, preserving both English and Arabic content. Indicate which sections are in which language."
# For forms
"Extract all form fields and their values from this document. Format as field:value pairs."

Implementation: To customize the prompt, locate this section in the code:

request_body = {
 "anthropic_version": "bedrock-2023年05月31日",
 "max_tokens": 4000,
 "temperature": 0,
 "messages": [
 {
 "role": "user",
 "content": [
 {
 "type": "text",
 "text": "Please extract all the text from this document." # Modify this line
 },
 {
 "type": "image",
 "source": {
 "type": "base64",
 "media_type": media_type,
 "data": base64.b64encode(file_content).decode('utf-8')
 }
 }
 ]
 }
 ]
}

Best Practices:
- Keep prompts clear and specific
- Test different prompts to find what works best for your document types
- Consider adding command-line options to select different pre-defined prompts

Customizing the prompt can significantly improve extraction quality for specific document types or use cases.

Output Files

The tool generates output files in the same directory as the input file, with the following naming pattern:

[original_filename]_extracted.txt
[original_filename]_extracted.docx
[original_filename]_extracted.xlsx

AWS Permissions Required

bedrock:InvokeModel permission for the Claude model being used

Limitations

Currently only supports Claude models from Amazon Bedrock
PDF extraction quality depends on the PDF's content (scanned vs. digital)
Some complex document layouts may not be perfectly preserved
PDF output generation is currently disabled due to compatibility issues

Troubleshooting

Font Issues:
- If Arabic/Urdu text doesn't display correctly, run python install-fonts.py to install fonts
- The tool will attempt to install fonts automatically, but may require manual installation in some cases
- Check the console output for font installation status and errors
AWS Errors: Ensure your AWS credentials have access to Amazon Bedrock and the specified model
Missing Dependencies: Install any missing Python packages as prompted
Download Issues: The tool uses both the requests library and curl as fallbacks for font downloads
PDF Generation: PDF output is currently disabled due to compatibility issues. Use the other output formats (TXT, DOCX, XLSX) instead.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

smustafa75/doc-reader

Folders and files

Latest commit

History

Repository files navigation

Document Text Extractor | مستخرج نصوص المستندات | دستاویز متن ایکسٹریکٹر

Features

Prerequisites

Installation

Usage

Command Line Arguments

Font Installation

How It Works

Financial Document Detection and Processing

Performance Considerations

Customizing the Claude Prompt

Output Files

AWS Permissions Required

Limitations

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

smustafa75/doc-reader

Folders and files

Latest commit

History

Repository files navigation

Document Text Extractor | مستخرج نصوص المستندات | دستاویز متن ایکسٹریکٹر

Features

Prerequisites

Installation

Usage

Command Line Arguments

Font Installation

How It Works

Financial Document Detection and Processing

Performance Considerations

Customizing the Claude Prompt

Output Files

AWS Permissions Required

Limitations

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages