Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add PDF Image Extractor script with README documentation #500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gracetyy wants to merge 1 commit into wasmerio:main
base: main
Choose a base branch
Loading
from gracetyy:pdf_image_extractor

Conversation

@gracetyy
Copy link

@gracetyy gracetyy commented Oct 20, 2025

This PR adds a new script, PDF Image Extractor, which recursively scans a directory tree for PDF files and extracts all embedded images from each document.

  • All extracted images are saved in a subfolder named PDF within the input root directory by default (customizable via --out).
  • Each PDF file is organized into its own folder, containing all images extracted from that document.
  • The script supports an optional --dedup flag to enable per-PDF deduplication of images.

Additional notes:

  • Please let me know if you’d like any changes to the folder naming or CLI options.
  • Happy to update documentation or add more examples if needed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new utility script that recursively extracts all embedded images from PDF files in a directory tree. The script uses PyMuPDF (fitz) to process PDFs and supports optional deduplication of images per document.

Key changes:

  • Adds pdf_image_extractor.py with command-line interface for PDF image extraction
  • Includes comprehensive README with usage examples and documentation
  • Supports customizable output directory and per-PDF deduplication options

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
PDF Image Extractor/pdf_image_extractor.py Main script implementing recursive PDF scanning and image extraction logic with CLI argument parsing
PDF Image Extractor/README.md Documentation covering requirements, usage, CLI options, and output structure

Code Quality Observations:

The implementation is generally well-structured with clear separation of concerns. However, there are a few technical issues to address:

  1. Potential crash with os.path.commonpath() (lines 14-16): The code uses os.path.commonpath([pdf_path, output_root]) which can raise a ValueError on Windows when paths are on different drives, or when they don't share a common ancestor. This would crash the script in common scenarios where users specify an output directory on a different drive. The logic appears intended to mirror the directory structure, but using the common path as the base is problematic. A simpler approach would be to calculate the relative path from the input pdf_dir directory.

  2. Inefficient directory creation logic (lines 35-36): The condition if img_count == 0 and not os.path.exists(output_folder) only creates the directory before writing the first image. While os.makedirs() is called with exist_ok=True, the double-check is redundant. It would be clearer to create the directory once before the loop if there are images to extract.

  3. Redundant deduplication checks (lines 27-30): The code checks if dedup twice - once to skip duplicates and again to add to the set. This could be simplified to a single conditional block.

  4. Missing requirements.txt: Several other projects in this repository include a requirements.txt file (e.g., PDF Merger, Image Watermarker, Image to ASCII). Adding one for this project would improve consistency and make dependency installation clearer for users.

  5. Missing error handling for image extraction: If doc.extract_image(xref) fails (line 32), the script will crash. While PyMuPDF is generally robust, adding a try-except block would make the script more resilient.

Documentation:
The README is well-written with clear examples and appropriate detail. The structure follows good practices with separate sections for requirements, usage, examples, and output structure.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a 100ドル gift card. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /