A TypeScript utility that converts PDF documents into structured JSON data while preserving text content, formatting, and hyperlinks. Perfect for resume parsing, document analysis, and content extraction workflows.
- Text Extraction: Extract text content with precise positioning and styling
- Hyperlink Detection: Capture clickable links with their coordinates and target URLs
- Font Preservation: Maintains font information for each text element
- Multi-page Support: Processes documents of any length
- Type Safety: Built with TypeScript for better development experience
- Lightweight: Minimal dependencies
Make sure you have the following installed on your system:
- Node.js (v16 or higher)
- npm (v7 or higher) or yarn
Using npm:
npm install @shilendra-dev/pdf-to-json
Or using yarn:
yarn add @shilendra-dev/pdf-to-json
This package requires the following peer dependencies which will be installed automatically:
pdfjs-dist: ^3.4.120 (PDF.js library for PDF parsing)@types/node: ^18.0.0 (TypeScript types for Node.js)
import { pdfToJson } from '@shilendra-dev/pdf-to-json'; import fs from 'fs/promises'; async function convertPdfToJson() { try { // Read PDF file const pdfBuffer = await fs.readFile('path/to/your/document.pdf'); // Convert to JSON const result = await pdfToJson(pdfBuffer, { outputPath: 'output.json' // Optional: Path to save the JSON output }); console.log('Conversion complete!'); console.log(`Processed ${result.numPages} pages`); } catch (error) { console.error('Error converting PDF:', error); } } convertPdfToJson();
Converts a PDF document to JSON.
Parameters:
pdfSource: PDF file as Buffer or file pathoptions: (Optional) Configuration optionsoutputPath: (string) Path to save the JSON output fileincludeTextContent: (boolean) Whether to include raw text content (default: true)includeStyles: (boolean) Whether to include font and style information (default: true)includeLinks: (boolean) Whether to include hyperlinks (default: true)
Returns: Promise that resolves to the parsed PDF data
The converter generates a JSON object with the following structure:
{ numPages: number; pages: Array<{ pageNumber: number; width: number; height: number; items: Array<{ type: 'text' | 'link'; content: string; x: number; y: number; width: number; height: number; fontFamily?: string; fontSize?: number; color?: string; url?: string; // For links }>; }>; }
import { pdfToJson } from '@shilendra-dev/pdf-to-json'; // Convert PDF from URL const response = await fetch('https://example.com/document.pdf'); const pdfBuffer = await response.arrayBuffer(); const result = await pdfToJson(Buffer.from(pdfBuffer)); // Process the extracted data result.pages.forEach(page => { console.log(`Page ${page.pageNumber} (${page.width}x${page.height}):`); console.log(`- Contains ${page.items.length} text items`); });
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ by Shilendra Singh