This is meant to be a performance-centric question as this type of conversion is obviously very common. I'm wondering about the possibilities for making this process faster.
I have a program that creates several thousand QR codes from a list, embeds them in an MS Word docx
template, and then converts the docx
files to pdf
. Problem is, what I've designed is very slow. When creating several thousand pdf
files, it takes hours on a local machine.
What can I do to speed this program up? Is there a way to multithread it? (Total newb to that topic). Or, what about my program design is inherently flawed?
Repeatable program below, meant to be run in local Windows 10 directory:
import pyqrcode
import pandas as pd
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
import glob
import os
from docx2pdf import convert
def make_folder():
os.mkdir("codes")
os.mkdir("docs")
def create_qr_code():
for index, values in df.iterrows():
data = barcode = values["barcode"]
image = pyqrcode.create(data)
image.png("codes\\"+f"{barcode}.png", scale=3)
def embed_qr_code():
qr_images = glob.glob("codes\\"+"*.png")
for image in qr_images:
image_name = os.path.basename(image)
doc = Document()
doc.add_picture(image)
last_paragraph = doc.paragraphs[-1]
last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.save("docs\\"+f"{image_name}.docx")
convert("docs\\"+f"{image_name}.docx")
def clean_file_names():
paths = (os.path.join(root, filename)
for root, _, filenames in os.walk("docs\\")
for filename in filenames)
for path in paths:
newname = path.replace(".png", "")
if newname != path:
os.rename(path, newname)
data = {'barcode': ['teconec', 'tegovec', 'teconvec', 'wettrot', 'wetocen']}
df = pd.DataFrame(data)
make_folder()
create_qr_code()
embed_qr_code()
clean_file_names()
Thank you!
1 Answer 1
I recently learned how easy it is to incorporate multiprocessing and multithreading into a Python program, and would like to share it with you.
Python's built-in multiprocessing
module offers a simple way to add multiprocessing to a program. However, since your program performs a lot of writes, I think it may be better to use the multiprocessing.dummy
module instead. This module offers the same API as the multiprocessing
module, but is used for multithreading instead of multiprocessing, and thus is better suited to programs that are IO intensive.
First, import the Pool
class from the multiprocessing.dummy
module:
import Pool from multiprocessing.dummy as ThreadPool
I aliased it as ThreadPool
just for added clarity that we are using multithreading and not multiprocessing. Next, take a look at all the for loops in your code. To add multithreading, we will have to change those for loops to call a function on each iteration instead of performing a series of steps:
def create_qr_code(values):
data = barcode = values["barcode"]
image = pyqrcode.create(data)
image.png("codes\\"+f"{barcode}.png", scale=3)
def create_qr_codes():
for index, values in df.iterrows():
create_qr_code(values)
def embed_qr_code(image):
image_name = os.path.basename(image)
doc = Document()
doc.add_picture(image)
last_paragraph = doc.paragraphs[-1]
last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.save("docs\\"+f"{image_name}.docx")
convert("docs\\"+f"{image_name}.docx")
def embed_qr_codes():
qr_images = glob.glob("codes\\"+"*.png")
for image in qr_images:
embed_qr_code(image)
def clean_file_name(path):
newname = path.replace(".png", "")
if newname != path:
os.rename(path, newname)
def clean_file_names():
paths = (os.path.join(root, filename)
for root, _, filenames in os.walk("docs\\")
for filename in filenames)
for path in paths:
clean_file_name(path)
Now for the fun part. In each of the functions that use for loops, we will replace the for loops with usages of ThreadPool
s as context managers. We can then call each ThreadPool
's map
method to perform the actions using multithreading:
def create_qr_codes():
rows = (values for _, values in df.iterrows())
with ThreadPool() as pool:
pool.map(create_qr_code, rows)
def embed_qr_codes():
qr_images = glob.glob("codes\\"+"*.png")
with ThreadPool() as pool:
pool.map(embed_qr_code, qr_images)
def clean_file_names():
paths = (os.path.join(root, filename)
for root, _, filenames in os.walk("docs\\")
for filename in filenames)
with ThreadPool() as pool:
pool.map(clean_file_name, paths)
Note that when instantiating a Pool, you can pass a value to it to specify the number of processes (or in our case, threads) to use. According to the Python docs:
If processes is None then the number returned by os.cpu_count() is used.
Also, don't forget to change these two function calls...
create_qr_code()
embed_qr_code()
...to these:
create_qr_codes()
embed_qr_codes()
This should speed up your program significantly. If not, try using the multiprocessing
module instead of multiprocessing.dummy
.
Finally, one extra tip. You may want to refactor this function:
def make_folder():
os.mkdir("codes")
os.mkdir("docs")
To use the os.makedirs
function instead to avoid raising an error if the directories already exist:
def make_folder():
os.makedirs("codes", exist_ok=True)
os.makedirs("docs", exist_ok=True)
Explore related questions
See similar questions with these tags.
docx
step - why not go directly to PDF - using something likereportlab
? \$\endgroup\$