Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

s3 assets streaming and conversion? #367

kimyu-ng started this conversation in General
Discussion options

Would it be possible to tweak the following example to stream pdf page by page without loading the whole pdf to memory?

# Existing approach
# download from cloud storage and load the whole pdf to memory
# and then perform the conversion 
file_name = "a.pdf"
pdf = Vips::Image.pdfload(file_name, access: :sequential)
n_pages = pdf.get('n-pages')
(0...n_pages).each do |page_index|
 pdf = Vips::Image.pdfload(file_name, access: :sequential, page: page_index)
 pdf.write_to_file("page_#{page_index}.png", Q:100)
end

Also, it seems weird that we have to initialize the object again to get to a particular page. Shouldn't there be an api like pdf.get_page(index) which equivalent to Vips::Image.pdfload(file_name, access: :sequential, page: page_index)?

You must be logged in to vote

Replies: 1 comment 4 replies

Comment options

Hi @kimyu92,

Unfortunately PDFs put a lot of document information at the end of the file, so you usually need to scan the whole thing before starting. As soon as you call poppler_document_new_from_stream(), the first thing it does is read the whole file.

Perhaps pdfium is less greedy? I've not tested it for this.

libvips will render a page at a time, so the actual rendering process shouldn't need that much memory.

Also, it seems weird that we have to initialize the object again to get to a particular page.

This is a consequence of the way that libvips handles multipage images -- it represents them as a single very tall, thin image, with the pages joined together vertically (a "toilet-roll" image, sorry). If your PDF has pages that are all the same size (for example, it has no pages in landscape), then you can load the whole PDF in one go and loop over pages without reinitialisation.

Sadly many PDFs are not like this, so to work for all PDFs, where each page can be a different size, you need to reinitialise.

With a PDF where all pages are the same size you can do:

$ irb 
irb(main):001:0> require 'vips'
=> true
irb(main):002:0> x = Vips::Image.new_from_file "nipguide.pdf", n: -1
=> #<Image 595x48836 uchar, 4 bands, srgb>
irb(main):003:0> x.get "page-height"
=> 842
irb(main):004:0> 

Then you can use crop to pull out pages and libvips will render them to bitmaps on demand.

pyvips has pagesplit() and pagejoin() convenience methods to turn these tall, thin images into arrays of page images. We should probably add them to ruby-vips as well.

You must be logged in to vote
4 replies
Comment options

pagesplit() and pagejoin() are definitely great addition for those consistent tall thin pdf page. However, I think even adding Vips::Image#get_page to abstract the reinitialization would be a great addition.

def get_page(index)
 Vips::Image.pdfload(file_name, access: :sequential, page: page_index)
end

Also, is there a way to get file_size from pdf, I couldn't find one 🤦‍♂️ at least it's not listed in get_fields

Comment options

is there a way to get file_size from pdf

You can use any file size API, I don't think ruby-vips needs to duplicate this, does it?

Comment options

Maybe. 😅 Probably it would be convenient #get_size for both image and pdf.

I do think we may also want to consider separate pdf instantiation to its own class. Vips::Pdf.load seems more fluent and instance method like #pages or #meta should make it more intuitive to use from OOP standpoint

Comment options

Ah maybe. Though this page stuff works for any multipage format, so it would need to include GIF, WEBP, TIFF, HEIC, AVIF, etc. etc. And reinit is only necessary if the page size changes, so having two APIs is useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants

AltStyle によって変換されたページ (->オリジナル) /