Python loading image into memory (numpy arrays) from database bytes field fast

Question 1

I am looking for feedback on the function below to load a png image stored as bytes from a MongoDB into numpy arrays.

from PIL import Image
import numpy as np
def bytes_to_matricies(image_bytes):
 """image bytes into Pillow image object
 image_bytes: image bytes accessed from Mongodb
 """
 raw_image = Image.open(io.BytesIO(image_bytes))
 greyscale_matrix = np.array(raw_image.convert("L"))
 color_matrix = np.array(raw_image.convert("RGB"))
 n = greyscale_matrix.shape[0]
 m = greyscale_matrix.shape[1]
 return greyscale_matrix, color_matrix, n, m

I have profiled my code with cProfile and found this function to be a big bottleneck. Any way to optimise it would be great. Note, I have compiled most of the project with Cython, which is why you'll see .pyx files. This hasn't affected much.

 Ordered by: internal time
 ncalls tottime percall cumtime percall filename:lineno(function)
 72 331.537 4.605 338.226 4.698 cleaner.pyx:154(clean_image)
 1 139.401 139.401 139.401 139.401 {built-in method builtins.input}
 356 31.144 0.087 31.144 0.087 {method 'recv_into' of '_socket.socket' objects}
 11253 15.421 0.001 15.421 0.001 {method 'encode' of 'ImagingEncoder' objects}
 706 10.561 0.015 10.561 0.015 {method 'decode' of 'ImagingDecoder' objects}
 72 5.044 0.070 5.047 0.070 {built-in method scipy.ndimage._ni_label._label}
 7853 0.881 0.000 0.881 0.000 cleaner.pyx:216(is_period)
 72 0.844 0.012 1.266 0.018 cleaner.pyx:349(get_binarized_matrix)
 72 0.802 0.011 0.802 0.011 {method 'convert' of 'ImagingCore' objects}
 72 0.786 0.011 13.167 0.183 cleaner.pyx:57(bytes_to_matricies)

If you are wondering how the images are encoded before being written into the MongoDB here is that code:

def get_encoded_image(filename: str):
 """Binary encodes image.
 """
 image = filesystem_io.read_as_pillow(filename) # Just reads file on disk into PILLOW Image object
 stream = io.BytesIO()
 image.save(stream, format='PNG')
 encoded_string = stream.getvalue()
 return encoded_string # This will be written to MongoDB

Things I have tried:

As mentioned above I tried compiling with Cython
I have tried to use the lycon library but could not see how to load from bytes.
I have tried using Pillow SIMD. It made things slower.
I am able to use multiprocessing. But I want to optimise the function before I parallalize it.

Thank you!

UPDATE: Answer to questions from Reinderein: The images are photographs of documents. They will eventually be OCR'd. I'm not sure how a lossy compression would affect the OCR quality. DPI is 320. Size on disk ~ 800kb each.

Question 2

What is the nature of your input image - size in DB, pixel dimensions, content? Is it something like a graph (lots of continuous colour regions) or a photograph? What is the reason to encode it in PNG? Does your image strictly need to be lossless?

Question 3

@Reinderien I have updated the question with answers to the above.

Question 4

One improvement opportunity, I see is to use YCbCr format rather than RGB. It will save you cost of conversion to grey because Y channel can directly give you grey image.

Question 5

I removed my previous answer because the results were misleading. After running your code step by step using the codetiming lib it seems that the bottleneck lies in the grayscale conversion (raw_image.convert("L")). But I believe that test is itself misleading - it is as if subsequent calls to .convert in Pillow were benefiting from some sort of cache. It's also possible the picture I used does not yield representative results. Numpy might be used for grayscale conversion instead of PIL - see: e2eml.school/convert_rgb_to_grayscale.html

Question 6

@Neil for us to test your code (as you have it), we would need some example images. Do you have the code plus sample images, perhaps on github? After we have that, we can run tests in an attempt to optimise. If the images are confidential, see if you can find others similar in form.

Question 7

matricies is not a word.

A crucial step in your pipeline, and one you have only implied, is the actual blob-loading from MongoDB. Let's assume that you use pymongo. You should be using bson.binary and not some intermediate representation like base64. The binary subtype should probably be byte.

To make a reference image, I severally copy-pasted screenshots of your question text into GIMP and exported the result as a PNG with these settings:

PNG export settings

At your stated 320 DPI, and assuming 8.5"x11", this produces

$ exiftool document.png 
ExifTool Version Number : 12.40
File Name : document.png
Directory : .
File Size : 718 KiB
File Modification Date/Time : .&checktime(2024,11,24,':') 11:16:27-05:00
File Access Date/Time : .&checktime(2024,11,24,':') 11:17:06-05:00
File Inode Change Date/Time : .&checktime(2024,11,24,':') 11:16:27-05:00
File Permissions : -rw-rw-r--
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 2720
Image Height : 3520
Bit Depth : 8
Color Type : RGB
Compression : Deflate/Inflate
Filter : Adaptive
Interlace : Noninterlaced
Background Color : 0 0 0
Pixels Per Unit X : 12598
Pixels Per Unit Y : 12598
Pixel Units : meters
Image Size : 2720x3520
Megapixels : 9.6

with a similar size to your ~800 KiB.

Since you care about mode L, the first and most obvious optimisation is to actually use that in your database. Again using the reference image I made, and switching to these settings:

greyscale

we get

$ exiftool document.png 
ExifTool Version Number : 12.40
File Name : document.png
Directory : .
File Size : 291 KiB
File Modification Date/Time : .&checktime(2024,11,24,':') 11:25:41-05:00
File Access Date/Time : .&checktime(2024,11,24,':') 11:17:06-05:00
File Inode Change Date/Time : .&checktime(2024,11,24,':') 11:25:41-05:00
File Permissions : -rw-rw-r--
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 2720
Image Height : 3520
Bit Depth : 8
Color Type : Grayscale
Compression : Deflate/Inflate
Filter : Adaptive
Interlace : Noninterlaced
Background Color : 0
Pixels Per Unit X : 12598
Pixels Per Unit Y : 12598
Pixel Units : meters
Image Size : 2720x3520
Megapixels : 9.6

This should fully obviate the first call to convert() and takes 59% less space.

I find it strange that your bytes_to_matrices returns both colour and greyscale images. If you really, really need the colour image as well (OCR benefiting from that is dubious) - and if the conversion is the bottleneck - then you can pursue a similar strategy where you save a second copy of the image in RGB8 format. The benefit may be blunted if e.g. there's a network hop to your database or your database hard drive is slow.

Another strategy to benchmark is to remove compression altogether. This will trade time for space; the image will take more database space but will hopefully be faster for PIL to load. Try BMP.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2024-11-24 16:45:25Z

matricies is not a word.

A crucial step in your pipeline, and one you have only implied, is the actual blob-loading from MongoDB. Let's assume that you use pymongo. You should be using bson.binary and not some intermediate representation like base64. The binary subtype should probably be byte.

To make a reference image, I severally copy-pasted screenshots of your question text into GIMP and exported the result as a PNG with these settings:

PNG export settings

At your stated 320 DPI, and assuming 8.5"x11", this produces

$ exiftool document.png 
ExifTool Version Number : 12.40
File Name : document.png
Directory : .
File Size : 718 KiB
File Modification Date/Time : .&checktime(2024,11,24,':') 11:16:27-05:00
File Access Date/Time : .&checktime(2024,11,24,':') 11:17:06-05:00
File Inode Change Date/Time : .&checktime(2024,11,24,':') 11:16:27-05:00
File Permissions : -rw-rw-r--
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 2720
Image Height : 3520
Bit Depth : 8
Color Type : RGB
Compression : Deflate/Inflate
Filter : Adaptive
Interlace : Noninterlaced
Background Color : 0 0 0
Pixels Per Unit X : 12598
Pixels Per Unit Y : 12598
Pixel Units : meters
Image Size : 2720x3520
Megapixels : 9.6

with a similar size to your ~800 KiB.

Since you care about mode L, the first and most obvious optimisation is to actually use that in your database. Again using the reference image I made, and switching to these settings:

greyscale

we get

$ exiftool document.png 
ExifTool Version Number : 12.40
File Name : document.png
Directory : .
File Size : 291 KiB
File Modification Date/Time : .&checktime(2024,11,24,':') 11:25:41-05:00
File Access Date/Time : .&checktime(2024,11,24,':') 11:17:06-05:00
File Inode Change Date/Time : .&checktime(2024,11,24,':') 11:25:41-05:00
File Permissions : -rw-rw-r--
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 2720
Image Height : 3520
Bit Depth : 8
Color Type : Grayscale
Compression : Deflate/Inflate
Filter : Adaptive
Interlace : Noninterlaced
Background Color : 0
Pixels Per Unit X : 12598
Pixels Per Unit Y : 12598
Pixel Units : meters
Image Size : 2720x3520
Megapixels : 9.6

This should fully obviate the first call to convert() and takes 59% less space.

I find it strange that your bytes_to_matrices returns both colour and greyscale images. If you really, really need the colour image as well (OCR benefiting from that is dubious) - and if the conversion is the bottleneck - then you can pursue a similar strategy where you save a second copy of the image in RGB8 format. The benefit may be blunted if e.g. there's a network hop to your database or your database hard drive is slow.

Another strategy to benchmark is to remove compression altogether. This will trade time for space; the image will take more database space but will hopefully be faster for PIL to load. Try BMP.

Stack Exchange Network

Python loading image into memory (numpy arrays) from database bytes field fast

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python loading image into memory (numpy arrays) from database bytes field fast

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions