I am looking for feedback on the function below to load a png image stored as bytes from a MongoDB into numpy arrays.
from PIL import Image
import numpy as np
def bytes_to_matricies(image_bytes):
"""image bytes into Pillow image object
image_bytes: image bytes accessed from Mongodb
"""
raw_image = Image.open(io.BytesIO(image_bytes))
greyscale_matrix = np.array(raw_image.convert("L"))
color_matrix = np.array(raw_image.convert("RGB"))
n = greyscale_matrix.shape[0]
m = greyscale_matrix.shape[1]
return greyscale_matrix, color_matrix, n, m
I have profiled my code with cProfile and found this function to be a big bottleneck. Any way to optimise it would be great. Note, I have compiled most of the project with Cython, which is why you'll see .pyx files. This hasn't affected much.
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
72 331.537 4.605 338.226 4.698 cleaner.pyx:154(clean_image)
1 139.401 139.401 139.401 139.401 {built-in method builtins.input}
356 31.144 0.087 31.144 0.087 {method 'recv_into' of '_socket.socket' objects}
11253 15.421 0.001 15.421 0.001 {method 'encode' of 'ImagingEncoder' objects}
706 10.561 0.015 10.561 0.015 {method 'decode' of 'ImagingDecoder' objects}
72 5.044 0.070 5.047 0.070 {built-in method scipy.ndimage._ni_label._label}
7853 0.881 0.000 0.881 0.000 cleaner.pyx:216(is_period)
72 0.844 0.012 1.266 0.018 cleaner.pyx:349(get_binarized_matrix)
72 0.802 0.011 0.802 0.011 {method 'convert' of 'ImagingCore' objects}
72 0.786 0.011 13.167 0.183 cleaner.pyx:57(bytes_to_matricies)
If you are wondering how the images are encoded before being written into the MongoDB here is that code:
def get_encoded_image(filename: str):
"""Binary encodes image.
"""
image = filesystem_io.read_as_pillow(filename) # Just reads file on disk into PILLOW Image object
stream = io.BytesIO()
image.save(stream, format='PNG')
encoded_string = stream.getvalue()
return encoded_string # This will be written to MongoDB
Things I have tried:
- As mentioned above I tried compiling with Cython
- I have tried to use the lycon library but could not see how to load from bytes.
- I have tried using Pillow SIMD. It made things slower.
- I am able to use multiprocessing. But I want to optimise the function before I parallalize it.
Thank you!
UPDATE: Answer to questions from Reinderein: The images are photographs of documents. They will eventually be OCR'd. I'm not sure how a lossy compression would affect the OCR quality. DPI is 320. Size on disk ~ 800kb each.
1 Answer 1
matricies is not a word.
A crucial step in your pipeline, and one you have only implied, is the actual blob-loading from MongoDB. Let's assume that you use pymongo
. You should be using bson.binary and not some intermediate representation like base64. The binary subtype should probably be byte
.
To make a reference image, I severally copy-pasted screenshots of your question text into GIMP and exported the result as a PNG with these settings:
At your stated 320 DPI, and assuming 8.5"x11", this produces
$ exiftool document.png
ExifTool Version Number : 12.40
File Name : document.png
Directory : .
File Size : 718 KiB
File Modification Date/Time : .&checktime(2024,11,24,':') 11:16:27-05:00
File Access Date/Time : .&checktime(2024,11,24,':') 11:17:06-05:00
File Inode Change Date/Time : .&checktime(2024,11,24,':') 11:16:27-05:00
File Permissions : -rw-rw-r--
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 2720
Image Height : 3520
Bit Depth : 8
Color Type : RGB
Compression : Deflate/Inflate
Filter : Adaptive
Interlace : Noninterlaced
Background Color : 0 0 0
Pixels Per Unit X : 12598
Pixels Per Unit Y : 12598
Pixel Units : meters
Image Size : 2720x3520
Megapixels : 9.6
with a similar size to your ~800 KiB.
Since you care about mode L, the first and most obvious optimisation is to actually use that in your database. Again using the reference image I made, and switching to these settings:
we get
$ exiftool document.png
ExifTool Version Number : 12.40
File Name : document.png
Directory : .
File Size : 291 KiB
File Modification Date/Time : .&checktime(2024,11,24,':') 11:25:41-05:00
File Access Date/Time : .&checktime(2024,11,24,':') 11:17:06-05:00
File Inode Change Date/Time : .&checktime(2024,11,24,':') 11:25:41-05:00
File Permissions : -rw-rw-r--
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 2720
Image Height : 3520
Bit Depth : 8
Color Type : Grayscale
Compression : Deflate/Inflate
Filter : Adaptive
Interlace : Noninterlaced
Background Color : 0
Pixels Per Unit X : 12598
Pixels Per Unit Y : 12598
Pixel Units : meters
Image Size : 2720x3520
Megapixels : 9.6
This should fully obviate the first call to convert()
and takes 59% less space.
I find it strange that your bytes_to_matrices
returns both colour and greyscale images. If you really, really need the colour image as well (OCR benefiting from that is dubious) - and if the conversion is the bottleneck - then you can pursue a similar strategy where you save a second copy of the image in RGB8 format. The benefit may be blunted if e.g. there's a network hop to your database or your database hard drive is slow.
Another strategy to benchmark is to remove compression altogether. This will trade time for space; the image will take more database space but will hopefully be faster for PIL to load. Try BMP.
raw_image.convert("L")
). But I believe that test is itself misleading - it is as if subsequent calls to .convert in Pillow were benefiting from some sort of cache. It's also possible the picture I used does not yield representative results. Numpy might be used for grayscale conversion instead of PIL - see: e2eml.school/convert_rgb_to_grayscale.html \$\endgroup\$