2
\$\begingroup\$

Autodidact and inspired by David Beazley code (to improve my skills in Python), I would like to get your feedback on this parser code.

The lazy property lets the code be computed only one time if used (https://github.com/dabeaz/python-cookbook/blob/master/src/8/lazily_computed_attributes/example1.py)


class lazyproperty:
 def __init__(self, func):
 self.func = func
 def __get__(self, instance, cls):
 if instance is None:
 return self
 else:
 value = self.func(instance)
 setattr(instance, self.func.__name__, value)
 return value
class PDFParser():
 """
 """
 def __init__(self,filepath,page_num=0):
 self.filepath = filepath
 try:
 self._doc = fitz.open(filepath)
 self.page_num = page_num
 self._page = self._doc[page_num]
 except Exception as e: 
 print("Lecture PDF impossible. {}".format(e))
 @lazyproperty
 def text(self):
 return self._page.getText()
 @lazyproperty
 def _pixs(self):
 imgs = self._doc.getPageImageList(self.page_num)
 pixs =[]
 for img in imgs:
 xref = img[0]
 pix = fitz.Pixmap(self._doc, xref)
 pixs.append(pix)
 return pixs
 @lazyproperty
 def _pixpage(self):
 pix = self._page.getPixmap(colorspace=fitz.csGRAY)
 return pix
 
 @property 
 def img(self):
 return self.imgs[0]
 @lazyproperty
 def imgs(self):
 pixs = self._pixs
 imgsarray = []
 for pix in pixs:
 img = self.pix2np(pix)
 imgsarray.append(img)
 return imgsarray
 
 def write(self,outputdir,fullpage=False):
 filename = os.path.basename(self.filepath).split('.pdf')[0]
 def writePNG(pix,filepath):
 # This is GRAY or RGB
 try: 
 pix.writePNG(filepath)
 # CMYK: convert to RGB first
 except: 
 pix = fitz.Pixmap(fitz.csRGB, pix)
 pix.writePNG(filepath)
 pix = None
 if fullpage:
 filepath = os.path.join(outputdir,'{}_p{}.png'.format(filename,self.page_num))
 pix = self._pixpage
 writePNG(pix,filepath)
 return
 pixs = self._pixs
 for i,pix in enumerate(pixs):
 filepath = os.path.join(outputdir,'{}_p{}_i{}.png'.format(filename,self.page_num,i))
 writePNG(pix,filepath)
 return
 @staticmethod
 def pix2np(pix):
 """
 Convert pixmap to image np.ndarray
 https://stackoverflow.com/questions/53059007/python-opencv
 param pix: pixmap
 """
 import numpy as np
 #https://stackoverflow.com/questions/22236749/numpy-what-is-the-difference-between-frombuffer-and-fromstring
 im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.h, pix.w, pix.n)
 try:
 im = np.ascontiguousarray(im[..., [2, 1, 0]]) # rgb to bgr
 except IndexError:
 #Trick to convert Gray rto BGR, (im.reshape)
 logger.warning("Shape of image array is {}".format(im.shape)) 
 im = cv2.cvtColor(im,cv2.COLOR_GRAY2RGB)
 im = np.ascontiguousarray(im[..., [2, 1, 0]])
 return im
toolic
14.6k5 gold badges29 silver badges204 bronze badges
asked Oct 24, 2020 at 16:12
\$\endgroup\$
0

2 Answers 2

3
\$\begingroup\$

@cache

Thank you for the "lets the code be computed only one time" context, and for the URL citation.

It's not clear that @lazyproperty does anything for your use case that the familiar and well-tested @cache decorator hasn't already covered. If it is bringing something new to the party, please write a """docstring""" which explains that.

The Beazly examples are for immutable shapes. You seem to invite the caller to update the public .page_num attribute, which can interact with some of those lazy evaluations. It's worth documenting the details. If there's restrictions on what caller should do, then tell us.

swallowed exception

This is bad, very bad:

 try:
 self._doc = fitz.open(filepath)
 ...
 except Exception as e: 
 print("Lecture PDF impossible. {}".format(e))

If you feel that's a better diagnostic, and your user needs to see it, then great! But the next line must be:

 raise

The OP code is willing to construct an unusable object in the case of FileNotFound.

context manager

I saw the call to fitz.open(filepath), but I didn't notice a close() call. Consider turning this class into a with context manager, so caller will automatically close the PDF file once it goes out of scope.

docstring

class PDFParser():
 """
 """

Well, that was a good beginning. Now finish it.

bare except

 except: 

There are good reasons for avoiding this anti-pattern, such as messing up CTRL/C handling.

It's not clear that pix was ever set and that the initial write attempt could ever succeed.

 pix.writePNG(filepath)
 pix = None

If you were hoping to save some RAM there, it's not clear it actually produced any interesting effect, given that pix will go out of scope once we return.

explicit return

 return

We evaluated write() for side effects. Prefer to just fall off the end, rather than explicitly saying return.

The effect is the same: caller will get a None value.

answered Oct 22, 2024 at 23:34
\$\endgroup\$
2
\$\begingroup\$

In addition to the good advice of the previous answer, consider the following as well.

Naming

Many of the functions and variables do not have descriptive names. For example, if img and imgs mean "image" and "images", then use the actual words because they are easier to read.

def imgs(self):

If that function gets or creates images, perhaps rename it as:

def get_images(self):
 pixels = self._pixs
 images = []
 for pixel in pixels:
 images.append(self.pix2np(pixel))
 return images

The above has the words spelled out.

Also, the code is simpler with the temporary img variable removed.

Documentation

Add a docstring at the top of the file to summarize the purpose of the code. Also add docstrings to describe more of the functions.

Commas

The code has inconsistent comma usage. The black program can be used to automatically add consistent space after commas.

Import

import lines are conventionally declared at the top of the code, not inside a class function:

def pix2np(pix):
 """
 Convert pixmap to image np.ndarray
 https://stackoverflow.com/questions/53059007/python-opencv
 param pix: pixmap
 """
 import numpy as np
answered Oct 23, 2024 at 11:00
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.