Add PDF Image Extractor script with README documentation #500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

gracetyy wants to merge 1 commit into wasmerio:main

from gracetyy:pdf_image_extractor

Open

Add PDF Image Extractor script with README documentation #500

gracetyy wants to merge 1 commit into wasmerio:main from gracetyy:pdf_image_extractor

Conversation

@gracetyy

Copy link

@gracetyy gracetyy commented Oct 20, 2025

This PR adds a new script, PDF Image Extractor, which recursively scans a directory tree for PDF files and extracts all embedded images from each document.

All extracted images are saved in a subfolder named PDF within the input root directory by default (customizable via --out).
Each PDF file is organized into its own folder, containing all images extracted from that document.
The script supports an optional --dedup flag to enable per-PDF deduplication of images.

Additional notes:

Please let me know if you’d like any changes to the folder naming or CLI options.
Happy to update documentation or add more examples if needed.

@gracetyy


 Add PDF Image Extractor script with README documentation

d68586e

@DhanushNehru DhanushNehru requested a review from Copilot

November 25, 2025 08:59

Copilot started reviewing on behalf of DhanushNehru

November 25, 2025 09:00

View session

Copilot finished reviewing on behalf of DhanushNehru

November 25, 2025 09:01

Copilot AI reviewed

Nov 25, 2025

View reviewed changes

Copy link

Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new utility script that recursively extracts all embedded images from PDF files in a directory tree. The script uses PyMuPDF (fitz) to process PDFs and supports optional deduplication of images per document.

Key changes:

Adds pdf_image_extractor.py with command-line interface for PDF image extraction
Includes comprehensive README with usage examples and documentation
Supports customizable output directory and per-PDF deduplication options

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
PDF Image Extractor/pdf_image_extractor.py	Main script implementing recursive PDF scanning and image extraction logic with CLI argument parsing
PDF Image Extractor/README.md	Documentation covering requirements, usage, CLI options, and output structure

Code Quality Observations:

The implementation is generally well-structured with clear separation of concerns. However, there are a few technical issues to address:

Potential crash with os.path.commonpath() (lines 14-16): The code uses os.path.commonpath([pdf_path, output_root]) which can raise a ValueError on Windows when paths are on different drives, or when they don't share a common ancestor. This would crash the script in common scenarios where users specify an output directory on a different drive. The logic appears intended to mirror the directory structure, but using the common path as the base is problematic. A simpler approach would be to calculate the relative path from the input pdf_dir directory.
Inefficient directory creation logic (lines 35-36): The condition if img_count == 0 and not os.path.exists(output_folder) only creates the directory before writing the first image. While os.makedirs() is called with exist_ok=True, the double-check is redundant. It would be clearer to create the directory once before the loop if there are images to extract.
Redundant deduplication checks (lines 27-30): The code checks if dedup twice - once to skip duplicates and again to add to the set. This could be simplified to a single conditional block.
Missing requirements.txt: Several other projects in this repository include a requirements.txt file (e.g., PDF Merger, Image Watermarker, Image to ASCII). Adding one for this project would improve consistency and make dependency installation clearer for users.
Missing error handling for image extraction: If doc.extract_image(xref) fails (line 32), the script will crash. While PyMuPDF is generally robust, adding a try-except block would make the script more resilient.

Documentation:
The README is well-written with clear examples and appropriate detail. The structure follows good practices with separate sections for requirements, usage, examples, and output structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a 100ドル gift card. Take the survey.

Labels

None yet

1 participant

@gracetyy

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add PDF Image Extractor script with README documentation #500

Are you sure you want to change the base?

Add PDF Image Extractor script with README documentation #500

Uh oh!

Conversation

@gracetyy gracetyy commented Oct 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant