-
Notifications
You must be signed in to change notification settings - Fork 651
Extracted image from pdf is completely black #1407
-
I am working on image extraction from PDF. The library can detect the image in the PDF page correctly, But while saving it or displaying it I get a completely black image.
For your reference, I am attaching a file that contains the image byte stream which is extracted from pdf. But while saving or displaying it, it's completely black.
image_screenshot_of_pdf
test
.
byte_stream.txt
Beta Was this translation helpful? Give feedback.
All reactions
Your way of image extraction is unable to deal with images having an image mask.
Your PDF however has 2 images, each with an image mask:
>>> from pprint import pprint >>> >>> pprint(page.get_images(True)) [(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0), (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)] >>>
to extract such images, a special coding must be used: e-g- for the first one (xref 19, mask xref 25):
pix19 = fitz.Pixmap(doc, 19) mask = fitz.Pixmap(doc, 25) pix = fitz.Pixmap(pix19, mask) pix.save("test.png") # fully recovered image
Replies: 8 comments 7 replies
-
The attachment doesn't help - please provide the document and the code you used for extraction.
Beta Was this translation helpful? Give feedback.
All reactions
-
Test.pdf
Code:
import io
import fitz
from PIL import Image
path = 'Test.pdf'
doc = fitz.open(path, filetype="pdf")
page_count = doc.page_count
if page_count:
for page_no in range(page_count):
blocks = doc[page_no].getText('dict')['blocks']
for ind, block in enumerate(blocks):
if block['type'] == 1:
try:
image = Image.open(io.BytesIO(block['image']))
image.save(open(f"test.{block['ext']}", "wb"))
except Exception as e:
print(e)
Beta Was this translation helpful? Give feedback.
All reactions
-
Your way of image extraction is unable to deal with images having an image mask.
Your PDF however has 2 images, each with an image mask:
>>> from pprint import pprint >>> >>> pprint(page.get_images(True)) [(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0), (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)] >>>
to extract such images, a special coding must be used: e-g- for the first one (xref 19, mask xref 25):
pix19 = fitz.Pixmap(doc, 19) mask = fitz.Pixmap(doc, 25) pix = fitz.Pixmap(pix19, mask) pix.save("test.png") # fully recovered image
Beta Was this translation helpful? Give feedback.
All reactions
-
Hey, Thank you for the snippet of code but I am getting an error while doing this pix = fitz.Pixmap(pix19, mask)
,
Full code:
>>import fitz
>>path = './Test.pdf'
>>doc = fitz.open(path, filetype='pdf')
>>from pprint import print
>>for page in doc:
... pprint(page.get_images(True))
[(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0),
(20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)]
>>pix19 = fitz.Pixmap(doc, 19)
>>mask = fitz.Pixmap(doc, 25)
>>pix = fitz.Pixmap(pix19, mask)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/yash/git/virtual_enviroments/test/lib/python3.8/site-packages/fitz/fitz.py", line 6467, in __init__
_fitz.Pixmap_swiginit(self, _fitz.new_Pixmap(*args))
TypeError: Wrong number or type of arguments for overloaded function 'new_Pixmap'.
Possible C/C++ prototypes are:
Pixmap::Pixmap(struct Colorspace *,PyObject *,int)
Pixmap::Pixmap(struct Colorspace *,struct Pixmap *)
Pixmap::Pixmap(struct Pixmap *,float,float,PyObject *)
Pixmap::Pixmap(struct Pixmap *,int)
Pixmap::Pixmap(struct Colorspace *,int,int,PyObject *,int)
Pixmap::Pixmap(PyObject *)
Pixmap::Pixmap(struct Document *,int)
What am I doing wrong?
System Specification:
Ubuntu 20.04.3 LTS
Python 3.8.10
PyMuPDF 1.18.17
Beta Was this translation helpful? Give feedback.
All reactions
-
Sorry forgot to mention that you need to upgrade to v1.19.x for this to work.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
hi, I'm facing the same issue, I try to extract image from pdf file but the image has an black background,
I tried the code:
page.get_images(True)
which returns one of the image:
[(1088, 6181, 1010, 485, 8, 'DeviceRGB', '', 'Image158', 'FlateDecode', 0)
then I copied your example:
pix1088 = fitz.Pixmap(doc, 1088)
print(pix1088.alpha)
mask = fitz.Pixmap(doc, 6181)
print(mask.alpha)
pix = fitz.Pixmap(pix1088, mask)
pix1088.save("test1.png")
but there is an error while executing fitz.Pixmap(pix1088, mask) :
RuntimeError: color pixmap must not have an alpha channel
I found this is because pix1088 contains transparency information, pix1088.alpha = 1 and mask.alpha = 0
Could you please help, how to extract image in this case, so there is no black background. thanks.
my environment:
windows 10
python 3.9
pymupdf 1.19.3
Beta Was this translation helpful? Give feedback.
All reactions
-
looks awkward, let me have your file / page please
Beta Was this translation helpful? Give feedback.
All reactions
-
the image is in page 4
Beta Was this translation helpful? Give feedback.
All reactions
-
here is my test code:
import fitz
from pprint import pprint
doc = fitz.open("test.pdf")
page = doc.load_page(3)
pprint(page.get_images(True))
pix1088 = fitz.Pixmap(doc, 1088)
print(pix1088.alpha)
mask = fitz.Pixmap(doc, 6181)
print(mask.alpha)
pix = fitz.Pixmap(pix1088, mask)
pix1088.save("test1.png")
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for the file.
I have looked into it: this case is unsupported by (Py-) MuPDF, sorry. If you look at the mask xref, you will find the key /Matte
, which means that a special color premultiplication must take place with the entries of this parameter.
This does not work currently.
Beta Was this translation helpful? Give feedback.
All reactions
-
The following may give you a somewhat better result:
pix = fitz.Pixmap(doc, 1088) mask = fitz.Pixmap(doc, 6181) pix.set_alpha(mask.samples)
Beta Was this translation helpful? Give feedback.
All reactions
-
The Pixmap.set_alpha()
method does the same (or similar) thing as the approach that you used. The difference is that it requires a pixmap with an alpha channel, so it is appropriate in your situation.
Method .set_alpha()
is my own making, so there may be a way to build logic that can cope with masks having a /Matte
definition ...
Beta Was this translation helpful? Give feedback.
All reactions
-
@SummerXXXX - in the meantime I also tested yet another approach:
The only problem in your case is that the base image has an alpha channel. This prevents that applying the mask directly.
But if we first remove that alpha channel, then the method does work with the thus modified base image.
So if you do the following then everything works fine:
pix1088 = fitz.Pixmap(doc,1088) mask = fitz.Pixmap(doc, 6181) if pix1088.alpha: temp = fitz.Pixmap(pix1088, 0) # make temp pixmap w/o the alpha pix1088 = None # release storage pix1088 = temp pix = fitz.Pixmap(pix1088, mask) # now compose final pixmap pix.save("image1088.png")
This method works with the example file, because all the /Matte
(background color) keys have the value [0 0 0]
, which has zero effect: a normal premultiply will work in this case.
For the next version, I plan a modification which hopefully provides more of these cases.
Beta Was this translation helpful? Give feedback.
All reactions
-
thanks for your help, I will try this
Beta Was this translation helpful? Give feedback.
All reactions
-
i am also having same issue with my code. black back ground images are extracted from pdf. but need proper images as in pdf. code used: def extract_and_save(input_pdf_path, output_pdf_path):
doc = fitz.open(input_pdf_path)
image_list = []
for page_num in range(len(doc)):
page = doc[page_num]
images = page.get_images(full=True)
print(f"Page {page_num + 1}: Found {len(images)} images")
for img_idx, img in enumerate(images):
try:
xref = img[0]
# Get the image XObject dictionary
img_dict = doc.xref_object(xref, compressed=True)
pix = fitz.Pixmap(doc, xref)
# Try to get the soft mask (transparency mask)
smask_xref = None
for line in img_dict.splitlines():
if "/SMask" in line:
# Extract the xref number after /SMask
# Example line: "/SMask 123 0 R"
parts = line.strip().split()
if len(parts) >= 2 and parts[0] == "/SMask":
try:
smask_xref = int(parts[1])
break
except:
pass
if smask_xref:
# Extract main image
img_pix = fitz.Pixmap(doc, xref)
img_np = np.frombuffer(img_pix.samples, dtype=np.uint8)
img_np = img_np.reshape((img_pix.height, img_pix.width, img_pix.n))
img_pix = None
# Extract mask image
mask_pix = fitz.Pixmap(doc, smask_xref)
mask_np = np.frombuffer(mask_pix.samples, dtype=np.uint8)
mask_np = mask_np.reshape((mask_pix.height, mask_pix.width))
mask_pix = None
# Combine image + alpha mask into RGBA
if img_np.shape[2] == 3:
rgba_np = np.dstack((img_np, mask_np))
else:
rgba_np = img_np # fallback
pil_img = Image.fromarray(rgba_np, mode="RGBA")
else:
# Convert CMYK or grayscale to RGB if needed
if pix.n >= 4 or pix.alpha or pix.colorspace != fitz.csRGB:
pix = fitz.Pixmap(fitz.csRGB, pix)
img_bytes = pix.tobytes("png")
#pix = None
pil_img = Image.open(BytesIO(img_bytes))
# Flatten alpha if present
if pil_img.mode in ("RGBA", "LA"):
background = Image.new("RGB", pil_img.size, (255, 255, 255))
background.paste(pil_img, mask=pil_img.getchannel("A")) # Use the last channel as alpha mask
pil_img = background
else:
pil_img = pil_img.convert("RGB")
# Skip small images
if pil_img.width < 300 or pil_img.height < 300:
continue
# Skip near blank images
stat = ImageStat.Stat(pil_img)
if max(stat.stddev) < 1.0:
continue
image_list.append(pil_img)
except Exception as e:
print(f"Error processing image {img_idx + 1} on page {page_num + 1}: {e}")
if image_list:
image_list[0].save(
output_pdf_path,
save_all=True,
append_images=image_list[1:],
resolution=100.0
)
print(f"Saved {len(image_list)} images into '{output_pdf_path}'")
else:
print("No valid images found to save.")
Will appreciate if you answer as quickly as possible.
Beta Was this translation helpful? Give feedback.