-
Notifications
You must be signed in to change notification settings - Fork 62
-
Would it be possible to tweak the following example to stream pdf page by page without loading the whole pdf to memory?
# Existing approach # download from cloud storage and load the whole pdf to memory # and then perform the conversion file_name = "a.pdf" pdf = Vips::Image.pdfload(file_name, access: :sequential) n_pages = pdf.get('n-pages') (0...n_pages).each do |page_index| pdf = Vips::Image.pdfload(file_name, access: :sequential, page: page_index) pdf.write_to_file("page_#{page_index}.png", Q:100) end
Also, it seems weird that we have to initialize the object again to get to a particular page. Shouldn't there be an api like pdf.get_page(index) which equivalent to Vips::Image.pdfload(file_name, access: :sequential, page: page_index)
?
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 4 replies
-
Hi @kimyu92,
Unfortunately PDFs put a lot of document information at the end of the file, so you usually need to scan the whole thing before starting. As soon as you call poppler_document_new_from_stream()
, the first thing it does is read the whole file.
Perhaps pdfium is less greedy? I've not tested it for this.
libvips will render a page at a time, so the actual rendering process shouldn't need that much memory.
Also, it seems weird that we have to initialize the object again to get to a particular page.
This is a consequence of the way that libvips handles multipage images -- it represents them as a single very tall, thin image, with the pages joined together vertically (a "toilet-roll" image, sorry). If your PDF has pages that are all the same size (for example, it has no pages in landscape), then you can load the whole PDF in one go and loop over pages without reinitialisation.
Sadly many PDFs are not like this, so to work for all PDFs, where each page can be a different size, you need to reinitialise.
With a PDF where all pages are the same size you can do:
$ irb
irb(main):001:0> require 'vips'
=> true
irb(main):002:0> x = Vips::Image.new_from_file "nipguide.pdf", n: -1
=> #<Image 595x48836 uchar, 4 bands, srgb>
irb(main):003:0> x.get "page-height"
=> 842
irb(main):004:0>
Then you can use crop
to pull out pages and libvips will render them to bitmaps on demand.
pyvips has pagesplit()
and pagejoin()
convenience methods to turn these tall, thin images into arrays of page images. We should probably add them to ruby-vips as well.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
pagesplit()
and pagejoin()
are definitely great addition for those consistent tall thin pdf page. However, I think even adding Vips::Image#get_page
to abstract the reinitialization would be a great addition.
def get_page(index) Vips::Image.pdfload(file_name, access: :sequential, page: page_index) end
Also, is there a way to get file_size from pdf, I couldn't find one 🤦♂️ at least it's not listed in get_fields
Beta Was this translation helpful? Give feedback.
All reactions
-
is there a way to get file_size from pdf
You can use any file size API, I don't think ruby-vips needs to duplicate this, does it?
Beta Was this translation helpful? Give feedback.
All reactions
-
Maybe. 😅 Probably it would be convenient #get_size
for both image and pdf.
I do think we may also want to consider separate pdf instantiation to its own class. Vips::Pdf.load
seems more fluent and instance method like #pages
or #meta
should make it more intuitive to use from OOP standpoint
Beta Was this translation helpful? Give feedback.
All reactions
-
Ah maybe. Though this page stuff works for any multipage format, so it would need to include GIF, WEBP, TIFF, HEIC, AVIF, etc. etc. And reinit is only necessary if the page size changes, so having two APIs is useful.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1