Speeding up Python program that converts DOCX to PDF in Windows

Question 1

This is meant to be a performance-centric question as this type of conversion is obviously very common. I'm wondering about the possibilities for making this process faster.

I have a program that creates several thousand QR codes from a list, embeds them in an MS Word docx template, and then converts the docx files to pdf. Problem is, what I've designed is very slow. When creating several thousand pdf files, it takes hours on a local machine.

What can I do to speed this program up? Is there a way to multithread it? (Total newb to that topic). Or, what about my program design is inherently flawed?

Repeatable program below, meant to be run in local Windows 10 directory:

import pyqrcode
import pandas as pd
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
import glob
import os
from docx2pdf import convert
def make_folder():
 os.mkdir("codes")
 os.mkdir("docs")
def create_qr_code():
 for index, values in df.iterrows():
 data = barcode = values["barcode"]
 image = pyqrcode.create(data)
 image.png("codes\\"+f"{barcode}.png", scale=3)
def embed_qr_code():
 qr_images = glob.glob("codes\\"+"*.png")
 for image in qr_images:
 image_name = os.path.basename(image)
 doc = Document()
 doc.add_picture(image)
 last_paragraph = doc.paragraphs[-1] 
 last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
 doc.save("docs\\"+f"{image_name}.docx")
 convert("docs\\"+f"{image_name}.docx")
def clean_file_names():
 paths = (os.path.join(root, filename)
 for root, _, filenames in os.walk("docs\\")
 for filename in filenames)
 for path in paths:
 newname = path.replace(".png", "")
 if newname != path:
 os.rename(path, newname)
data = {'barcode': ['teconec', 'tegovec', 'teconvec', 'wettrot', 'wetocen']}
df = pd.DataFrame(data)
make_folder()
create_qr_code()
embed_qr_code()
clean_file_names()

Thank you!

Question 2

Why the docx step - why not go directly to PDF - using something like reportlab?

Question 3

Since your Python code is small and straightforward (if not particularly maintainable), I assume the library functions you use are slow. You already have it split into three distinct steps (first, create QR codes; second, embed QR codes in DOCX; third, convert DOCX to PDF), so it should be straightforward to insert some code to measure how long each step takes, to get an idea which part slows you down. Finally, while Python is great for scripting because you get results fast, it is slow in general, and for more performance, you may need to step away from Python, and use something else.

Question 4

@dumetrulo Thank you, that is very helpful, what feedback would you have as far as improving maintainability?

Question 5

Did you profile your code? I have no experience with docx2pdf but it seems to be the bottleneck. You can save your image directly into pdf using PIL, thus replacing this time consuming process. You also can, perhaps, use LaTeX for that purpose as well.

Question 6

@dumetrulo Given the code shown Python is very unlikely to be the bottleneck, and libraries are instead. So my question about direct-to-PDF stands.

Question 7

I recently learned how easy it is to incorporate multiprocessing and multithreading into a Python program, and would like to share it with you.

Python's built-in multiprocessing module offers a simple way to add multiprocessing to a program. However, since your program performs a lot of writes, I think it may be better to use the multiprocessing.dummy module instead. This module offers the same API as the multiprocessing module, but is used for multithreading instead of multiprocessing, and thus is better suited to programs that are IO intensive.

First, import the Pool class from the multiprocessing.dummy module:

import Pool from multiprocessing.dummy as ThreadPool

I aliased it as ThreadPool just for added clarity that we are using multithreading and not multiprocessing. Next, take a look at all the for loops in your code. To add multithreading, we will have to change those for loops to call a function on each iteration instead of performing a series of steps:

def create_qr_code(values):
 data = barcode = values["barcode"]
 image = pyqrcode.create(data)
 image.png("codes\\"+f"{barcode}.png", scale=3)
def create_qr_codes():
 for index, values in df.iterrows():
 create_qr_code(values)
def embed_qr_code(image):
 image_name = os.path.basename(image)
 doc = Document()
 doc.add_picture(image)
 last_paragraph = doc.paragraphs[-1] 
 last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
 doc.save("docs\\"+f"{image_name}.docx")
 convert("docs\\"+f"{image_name}.docx")
def embed_qr_codes():
 qr_images = glob.glob("codes\\"+"*.png")
 for image in qr_images:
 embed_qr_code(image)
def clean_file_name(path):
 newname = path.replace(".png", "")
 if newname != path:
 os.rename(path, newname)
def clean_file_names():
 paths = (os.path.join(root, filename)
 for root, _, filenames in os.walk("docs\\")
 for filename in filenames)
 for path in paths:
 clean_file_name(path)

Now for the fun part. In each of the functions that use for loops, we will replace the for loops with usages of ThreadPools as context managers. We can then call each ThreadPool's map method to perform the actions using multithreading:

def create_qr_codes():
 rows = (values for _, values in df.iterrows())
 with ThreadPool() as pool:
 pool.map(create_qr_code, rows)
def embed_qr_codes():
 qr_images = glob.glob("codes\\"+"*.png")
 with ThreadPool() as pool:
 pool.map(embed_qr_code, qr_images)
def clean_file_names():
 paths = (os.path.join(root, filename)
 for root, _, filenames in os.walk("docs\\")
 for filename in filenames)
 with ThreadPool() as pool:
 pool.map(clean_file_name, paths)

Note that when instantiating a Pool, you can pass a value to it to specify the number of processes (or in our case, threads) to use. According to the Python docs:

If processes is None then the number returned by os.cpu_count() is used.

Also, don't forget to change these two function calls...

create_qr_code()
embed_qr_code()

...to these:

create_qr_codes()
embed_qr_codes()

This should speed up your program significantly. If not, try using the multiprocessing module instead of multiprocessing.dummy.

Finally, one extra tip. You may want to refactor this function:

def make_folder():
 os.mkdir("codes")
 os.mkdir("docs")

To use the os.makedirs function instead to avoid raising an error if the directories already exist:

def make_folder():
 os.makedirs("codes", exist_ok=True)
 os.makedirs("docs", exist_ok=True)

Brent Pappas Brent Pappas 1867 bronze badges · Accepted Answer · 2021-03-14 01:15:02Z

I recently learned how easy it is to incorporate multiprocessing and multithreading into a Python program, and would like to share it with you.

Python's built-in multiprocessing module offers a simple way to add multiprocessing to a program. However, since your program performs a lot of writes, I think it may be better to use the multiprocessing.dummy module instead. This module offers the same API as the multiprocessing module, but is used for multithreading instead of multiprocessing, and thus is better suited to programs that are IO intensive.

First, import the Pool class from the multiprocessing.dummy module:

import Pool from multiprocessing.dummy as ThreadPool

I aliased it as ThreadPool just for added clarity that we are using multithreading and not multiprocessing. Next, take a look at all the for loops in your code. To add multithreading, we will have to change those for loops to call a function on each iteration instead of performing a series of steps:

def create_qr_code(values):
 data = barcode = values["barcode"]
 image = pyqrcode.create(data)
 image.png("codes\\"+f"{barcode}.png", scale=3)
def create_qr_codes():
 for index, values in df.iterrows():
 create_qr_code(values)
def embed_qr_code(image):
 image_name = os.path.basename(image)
 doc = Document()
 doc.add_picture(image)
 last_paragraph = doc.paragraphs[-1] 
 last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
 doc.save("docs\\"+f"{image_name}.docx")
 convert("docs\\"+f"{image_name}.docx")
def embed_qr_codes():
 qr_images = glob.glob("codes\\"+"*.png")
 for image in qr_images:
 embed_qr_code(image)
def clean_file_name(path):
 newname = path.replace(".png", "")
 if newname != path:
 os.rename(path, newname)
def clean_file_names():
 paths = (os.path.join(root, filename)
 for root, _, filenames in os.walk("docs\\")
 for filename in filenames)
 for path in paths:
 clean_file_name(path)

Now for the fun part. In each of the functions that use for loops, we will replace the for loops with usages of ThreadPools as context managers. We can then call each ThreadPool's map method to perform the actions using multithreading:

def create_qr_codes():
 rows = (values for _, values in df.iterrows())
 with ThreadPool() as pool:
 pool.map(create_qr_code, rows)
def embed_qr_codes():
 qr_images = glob.glob("codes\\"+"*.png")
 with ThreadPool() as pool:
 pool.map(embed_qr_code, qr_images)
def clean_file_names():
 paths = (os.path.join(root, filename)
 for root, _, filenames in os.walk("docs\\")
 for filename in filenames)
 with ThreadPool() as pool:
 pool.map(clean_file_name, paths)

Note that when instantiating a Pool, you can pass a value to it to specify the number of processes (or in our case, threads) to use. According to the Python docs:

If processes is None then the number returned by os.cpu_count() is used.

Also, don't forget to change these two function calls...

create_qr_code()
embed_qr_code()

...to these:

create_qr_codes()
embed_qr_codes()

This should speed up your program significantly. If not, try using the multiprocessing module instead of multiprocessing.dummy.

Finally, one extra tip. You may want to refactor this function:

def make_folder():
 os.mkdir("codes")
 os.mkdir("docs")

To use the os.makedirs function instead to avoid raising an error if the directories already exist:

def make_folder():
 os.makedirs("codes", exist_ok=True)
 os.makedirs("docs", exist_ok=True)

Stack Exchange Network

Speeding up Python program that converts DOCX to PDF in Windows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speeding up Python program that converts DOCX to PDF in Windows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions