Autodidact and inspired by David Beazley code (to improve my skills in Python), I would like to get your feedback on this parser code.
The lazy property lets the code be computed only one time if used (https://github.com/dabeaz/python-cookbook/blob/master/src/8/lazily_computed_attributes/example1.py)
class lazyproperty:
def __init__(self, func):
self.func = func
def __get__(self, instance, cls):
if instance is None:
return self
else:
value = self.func(instance)
setattr(instance, self.func.__name__, value)
return value
class PDFParser():
"""
"""
def __init__(self,filepath,page_num=0):
self.filepath = filepath
try:
self._doc = fitz.open(filepath)
self.page_num = page_num
self._page = self._doc[page_num]
except Exception as e:
print("Lecture PDF impossible. {}".format(e))
@lazyproperty
def text(self):
return self._page.getText()
@lazyproperty
def _pixs(self):
imgs = self._doc.getPageImageList(self.page_num)
pixs =[]
for img in imgs:
xref = img[0]
pix = fitz.Pixmap(self._doc, xref)
pixs.append(pix)
return pixs
@lazyproperty
def _pixpage(self):
pix = self._page.getPixmap(colorspace=fitz.csGRAY)
return pix
@property
def img(self):
return self.imgs[0]
@lazyproperty
def imgs(self):
pixs = self._pixs
imgsarray = []
for pix in pixs:
img = self.pix2np(pix)
imgsarray.append(img)
return imgsarray
def write(self,outputdir,fullpage=False):
filename = os.path.basename(self.filepath).split('.pdf')[0]
def writePNG(pix,filepath):
# This is GRAY or RGB
try:
pix.writePNG(filepath)
# CMYK: convert to RGB first
except:
pix = fitz.Pixmap(fitz.csRGB, pix)
pix.writePNG(filepath)
pix = None
if fullpage:
filepath = os.path.join(outputdir,'{}_p{}.png'.format(filename,self.page_num))
pix = self._pixpage
writePNG(pix,filepath)
return
pixs = self._pixs
for i,pix in enumerate(pixs):
filepath = os.path.join(outputdir,'{}_p{}_i{}.png'.format(filename,self.page_num,i))
writePNG(pix,filepath)
return
@staticmethod
def pix2np(pix):
"""
Convert pixmap to image np.ndarray
https://stackoverflow.com/questions/53059007/python-opencv
param pix: pixmap
"""
import numpy as np
#https://stackoverflow.com/questions/22236749/numpy-what-is-the-difference-between-frombuffer-and-fromstring
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.h, pix.w, pix.n)
try:
im = np.ascontiguousarray(im[..., [2, 1, 0]]) # rgb to bgr
except IndexError:
#Trick to convert Gray rto BGR, (im.reshape)
logger.warning("Shape of image array is {}".format(im.shape))
im = cv2.cvtColor(im,cv2.COLOR_GRAY2RGB)
im = np.ascontiguousarray(im[..., [2, 1, 0]])
return im
2 Answers 2
@cache
Thank you for the "lets the code be computed only one time" context, and for the URL citation.
It's not clear that @lazyproperty does anything for your use case that the familiar and well-tested @cache decorator hasn't already covered. If it is bringing something new to the party, please write a """docstring""" which explains that.
The Beazly examples are for immutable shapes.
You seem to invite the caller to update the public
.page_num
attribute, which can interact with some
of those lazy evaluations.
It's worth documenting the details.
If there's restrictions on what caller should do,
then tell us.
swallowed exception
This is bad, very bad:
try:
self._doc = fitz.open(filepath)
...
except Exception as e:
print("Lecture PDF impossible. {}".format(e))
If you feel that's a better diagnostic, and your user needs to see it, then great! But the next line must be:
raise
The OP code is willing to construct an unusable object in the case of FileNotFound.
context manager
I saw the call to fitz.open(filepath)
,
but I didn't notice a close() call.
Consider turning this class into a with
context manager, so caller will automatically
close the PDF file once it goes out of scope.
docstring
class PDFParser():
"""
"""
Well, that was a good beginning. Now finish it.
bare except
except:
There are good reasons for avoiding this anti-pattern, such as messing up CTRL/C handling.
It's not clear that pix
was ever set and
that the initial write attempt could ever succeed.
pix.writePNG(filepath)
pix = None
If you were hoping to save some RAM there,
it's not clear it actually produced any
interesting effect,
given that pix
will go out of scope
once we return.
explicit return
return
We evaluated write() for side effects.
Prefer to just fall off the end,
rather than explicitly saying return
.
The effect is the same: caller will get a None
value.
In addition to the good advice of the previous answer, consider the following as well.
Naming
Many of the functions and variables do not have descriptive names. For example,
if img
and imgs
mean "image" and "images", then use the actual words
because they are easier to read.
def imgs(self):
If that function gets or creates images, perhaps rename it as:
def get_images(self):
pixels = self._pixs
images = []
for pixel in pixels:
images.append(self.pix2np(pixel))
return images
The above has the words spelled out.
Also, the code is simpler with the temporary img
variable removed.
Documentation
Add a docstring at the top of the file to summarize the purpose of the code. Also add docstrings to describe more of the functions.
Commas
The code has inconsistent comma usage. The black program can be used to automatically add consistent space after commas.
Import
import
lines are conventionally declared at the top of the code,
not inside a class function:
def pix2np(pix):
"""
Convert pixmap to image np.ndarray
https://stackoverflow.com/questions/53059007/python-opencv
param pix: pixmap
"""
import numpy as np