Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Extracted image from pdf is completely black #1407

Discussion options

I am working on image extraction from PDF. The library can detect the image in the PDF page correctly, But while saving it or displaying it I get a completely black image.
For your reference, I am attaching a file that contains the image byte stream which is extracted from pdf. But while saving or displaying it, it's completely black.
image_screenshot_of_pdf
test
.
byte_stream.txt

You must be logged in to vote

Your way of image extraction is unable to deal with images having an image mask.
Your PDF however has 2 images, each with an image mask:

>>> from pprint import pprint
>>> 
>>> pprint(page.get_images(True))
[(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0),
 (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)]
>>> 

to extract such images, a special coding must be used: e-g- for the first one (xref 19, mask xref 25):

pix19 = fitz.Pixmap(doc, 19)
mask = fitz.Pixmap(doc, 25)
pix = fitz.Pixmap(pix19, mask)
pix.save("test.png") # fully recovered image

Replies: 8 comments 7 replies

Comment options

The attachment doesn't help - please provide the document and the code you used for extraction.

You must be logged in to vote
0 replies
Comment options

Test.pdf
Code:

import io
import fitz
from PIL import Image
path = 'Test.pdf'
doc = fitz.open(path, filetype="pdf")
page_count = doc.page_count
if page_count:
 for page_no in range(page_count):
 blocks = doc[page_no].getText('dict')['blocks']
 for ind, block in enumerate(blocks):
 if block['type'] == 1:
 try:
 image = Image.open(io.BytesIO(block['image']))
 image.save(open(f"test.{block['ext']}", "wb"))
 except Exception as e:
 print(e)
You must be logged in to vote
0 replies
Comment options

Your way of image extraction is unable to deal with images having an image mask.
Your PDF however has 2 images, each with an image mask:

>>> from pprint import pprint
>>> 
>>> pprint(page.get_images(True))
[(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0),
 (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)]
>>> 

to extract such images, a special coding must be used: e-g- for the first one (xref 19, mask xref 25):

pix19 = fitz.Pixmap(doc, 19)
mask = fitz.Pixmap(doc, 25)
pix = fitz.Pixmap(pix19, mask)
pix.save("test.png") # fully recovered image
You must be logged in to vote
1 reply
Comment options

Hey, Thank you for the snippet of code but I am getting an error while doing this pix = fitz.Pixmap(pix19, mask),

Full code:

>>import fitz
>>path = './Test.pdf'
>>doc = fitz.open(path, filetype='pdf')
>>from pprint import print
>>for page in doc:
... pprint(page.get_images(True))
 
[(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0),
 (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)]
>>pix19 = fitz.Pixmap(doc, 19)
>>mask = fitz.Pixmap(doc, 25)
>>pix = fitz.Pixmap(pix19, mask)
Traceback (most recent call last):
 File "<input>", line 1, in <module>
 File "/home/yash/git/virtual_enviroments/test/lib/python3.8/site-packages/fitz/fitz.py", line 6467, in __init__
 _fitz.Pixmap_swiginit(self, _fitz.new_Pixmap(*args))
TypeError: Wrong number or type of arguments for overloaded function 'new_Pixmap'.
 Possible C/C++ prototypes are:
 Pixmap::Pixmap(struct Colorspace *,PyObject *,int)
 Pixmap::Pixmap(struct Colorspace *,struct Pixmap *)
 Pixmap::Pixmap(struct Pixmap *,float,float,PyObject *)
 Pixmap::Pixmap(struct Pixmap *,int)
 Pixmap::Pixmap(struct Colorspace *,int,int,PyObject *,int)
 Pixmap::Pixmap(PyObject *)
 Pixmap::Pixmap(struct Document *,int)

What am I doing wrong?

System Specification:
Ubuntu 20.04.3 LTS
Python 3.8.10
PyMuPDF 1.18.17

Answer selected by YashMistry349
Comment options

Sorry forgot to mention that you need to upgrade to v1.19.x for this to work.

You must be logged in to vote
4 replies
Comment options

hi, I'm facing the same issue, I try to extract image from pdf file but the image has an black background,
I tried the code:

page.get_images(True)

which returns one of the image:
[(1088, 6181, 1010, 485, 8, 'DeviceRGB', '', 'Image158', 'FlateDecode', 0)

then I copied your example:
pix1088 = fitz.Pixmap(doc, 1088)
print(pix1088.alpha)
mask = fitz.Pixmap(doc, 6181)
print(mask.alpha)
pix = fitz.Pixmap(pix1088, mask)
pix1088.save("test1.png")

but there is an error while executing fitz.Pixmap(pix1088, mask) :
RuntimeError: color pixmap must not have an alpha channel

I found this is because pix1088 contains transparency information, pix1088.alpha = 1 and mask.alpha = 0

Could you please help, how to extract image in this case, so there is no black background. thanks.

my environment:
windows 10
python 3.9
pymupdf 1.19.3

Comment options

looks awkward, let me have your file / page please

Comment options

test.pdf

the image is in page 4

Comment options

here is my test code:

import fitz
from pprint import pprint

doc = fitz.open("test.pdf")
page = doc.load_page(3)
pprint(page.get_images(True))
pix1088 = fitz.Pixmap(doc, 1088)
print(pix1088.alpha)
mask = fitz.Pixmap(doc, 6181)
print(mask.alpha)
pix = fitz.Pixmap(pix1088, mask)
pix1088.save("test1.png")

Comment options

Thanks for the file.
I have looked into it: this case is unsupported by (Py-) MuPDF, sorry. If you look at the mask xref, you will find the key /Matte, which means that a special color premultiplication must take place with the entries of this parameter.
This does not work currently.

You must be logged in to vote
0 replies
Comment options

The following may give you a somewhat better result:

pix = fitz.Pixmap(doc, 1088)
mask = fitz.Pixmap(doc, 6181)
pix.set_alpha(mask.samples)
You must be logged in to vote
1 reply
Comment options

The Pixmap.set_alpha() method does the same (or similar) thing as the approach that you used. The difference is that it requires a pixmap with an alpha channel, so it is appropriate in your situation.
Method .set_alpha() is my own making, so there may be a way to build logic that can cope with masks having a /Matte definition ...

Comment options

@SummerXXXX - in the meantime I also tested yet another approach:
The only problem in your case is that the base image has an alpha channel. This prevents that applying the mask directly.
But if we first remove that alpha channel, then the method does work with the thus modified base image.
So if you do the following then everything works fine:

pix1088 = fitz.Pixmap(doc,1088)
mask = fitz.Pixmap(doc, 6181)
if pix1088.alpha:
 temp = fitz.Pixmap(pix1088, 0) # make temp pixmap w/o the alpha
 pix1088 = None # release storage
 pix1088 = temp
pix = fitz.Pixmap(pix1088, mask) # now compose final pixmap
pix.save("image1088.png")

This method works with the example file, because all the /Matte (background color) keys have the value [0 0 0], which has zero effect: a normal premultiply will work in this case.

For the next version, I plan a modification which hopefully provides more of these cases.

You must be logged in to vote
1 reply
Comment options

thanks for your help, I will try this

Comment options

i am also having same issue with my code. black back ground images are extracted from pdf. but need proper images as in pdf. code used: def extract_and_save(input_pdf_path, output_pdf_path):
doc = fitz.open(input_pdf_path)
image_list = []

for page_num in range(len(doc)):
 page = doc[page_num]
 images = page.get_images(full=True)
 print(f"Page {page_num + 1}: Found {len(images)} images")
 for img_idx, img in enumerate(images):
 try:
 xref = img[0]
 # Get the image XObject dictionary
 img_dict = doc.xref_object(xref, compressed=True)
 pix = fitz.Pixmap(doc, xref)
 # Try to get the soft mask (transparency mask)
 smask_xref = None
 for line in img_dict.splitlines():
 if "/SMask" in line:
 # Extract the xref number after /SMask
 # Example line: "/SMask 123 0 R"
 parts = line.strip().split()
 if len(parts) >= 2 and parts[0] == "/SMask":
 try:
 smask_xref = int(parts[1])
 break
 except:
 pass
 
 if smask_xref:
 # Extract main image
 img_pix = fitz.Pixmap(doc, xref)
 img_np = np.frombuffer(img_pix.samples, dtype=np.uint8)
 img_np = img_np.reshape((img_pix.height, img_pix.width, img_pix.n))
 img_pix = None
 # Extract mask image
 mask_pix = fitz.Pixmap(doc, smask_xref)
 mask_np = np.frombuffer(mask_pix.samples, dtype=np.uint8)
 mask_np = mask_np.reshape((mask_pix.height, mask_pix.width))
 mask_pix = None
 # Combine image + alpha mask into RGBA
 if img_np.shape[2] == 3:
 rgba_np = np.dstack((img_np, mask_np))
 else:
 rgba_np = img_np # fallback
 pil_img = Image.fromarray(rgba_np, mode="RGBA")
 else:
 # Convert CMYK or grayscale to RGB if needed
 if pix.n >= 4 or pix.alpha or pix.colorspace != fitz.csRGB:
 pix = fitz.Pixmap(fitz.csRGB, pix)
 img_bytes = pix.tobytes("png")
 #pix = None
 pil_img = Image.open(BytesIO(img_bytes))
 # Flatten alpha if present
 if pil_img.mode in ("RGBA", "LA"):
 background = Image.new("RGB", pil_img.size, (255, 255, 255))
 background.paste(pil_img, mask=pil_img.getchannel("A")) # Use the last channel as alpha mask
 pil_img = background
 else:
 pil_img = pil_img.convert("RGB")
 # Skip small images
 if pil_img.width < 300 or pil_img.height < 300:
 continue
 # Skip near blank images
 stat = ImageStat.Stat(pil_img)
 if max(stat.stddev) < 1.0:
 continue
 image_list.append(pil_img)
 except Exception as e:
 print(f"Error processing image {img_idx + 1} on page {page_num + 1}: {e}")
if image_list:
 image_list[0].save(
 output_pdf_path,
 save_all=True,
 append_images=image_list[1:],
 resolution=100.0
 )
 print(f"Saved {len(image_list)} images into '{output_pdf_path}'")
else:
 print("No valid images found to save.")

Will appreciate if you answer as quickly as possible.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Converted from issue

This discussion was converted from issue #1406 on November 16, 2021 10:53.

AltStyle によって変換されたページ (->オリジナル) /