I am trying to convert a 111MB TEXT file to PDF quickly. I'm currently using the FPDF library, and large files take about 40 minutes to process. The text file is an IBM carriage control (ANSI) file that contains these characters (link below). Not to mention this file in particular repeats these ANSI characters way more than I've seen on any other file, and I am also looking for specific text that repeats on every page break, and I am replacing it with a blank space to remove it. With that said, I'm looking to get a second opinion on possibly a more powerful library that will allow me to quickly convert these text files to PDF.
I've optimized my code as much as possible to speed things up, but big files still take too long. I added ThreadPoolExecutor
for parallel processing to handle multiple files at once, and I’m reading files in 20MB chunks instead of all at once to reduce memory usage. I also precompiled regex to make text replacements faster and streamlined how control characters are processed in a single pass. To further improve performance, I introduced batch processing, setting a batch size of 700 to manage workload distribution more efficiently. Smaller files process instantly, but large ones are still a bottleneck.
Text Sample:
1 Send Inquiries To:
-
ACCOUNT NUMBER: 123456789
0 YTD DIV RECEIVED: 789.01
0 PAGE NUMBER: 1 of 1
1
0 www.spoonmuseum.com
-
HARRY POTTINGTON
PROFESSIONAL JUGGLER
456 PINEAPPLE RD
UNICORN CITY, UC 98765
-
Visit our "Spoon Museum" to see the world’s largest collection of spoons!
Get 10% off when you mention the phrase "I love spoons!"
0 SUMMARY OF YOUR ACCOUNTS
0 ______________________________________________________________________________________________________________________
| | | |
| SUFFIX 007 BANANA FUND | | |
| JOINT: HARRY POTTINGTON | | |
| STATEMENT PERIOD 01/15/25 - 01/15/25 | | |
| BEGINNING BALANCE 3,000.00 | | |
| DEPOSITS 100.00 | | |
| WITHDRAWALS 50.00 | | |
| BANANAS CLEARED 0.00 | | |
| ENDING BALANCE 3,050.00 | | |
| | | |
| PIE YEAR-TO-DATE 20.00 | | |
| PIE THIS PERIOD 5.00 | | |
| | | |
______________________________________________________________________________________________________________________
0 SUFFIX 007 BANANA FUND
______________________________________________________________________________________________________________________
0 DEPOSITS
--------
DATE DESCRIPTION TRANSACTION AMOUNT LOCATION
---- ----------- ------------------ --------
01/15/25 DEPOSIT FROM GIGANTIC PIZZA PARTY 100.00 PIZZA WORLD
______________________________________________________________________________________________________________________
1 Send Inquiries To:
-
ACCOUNT NUMBER: 987654321
0 YTD DIV RECEIVED: 5,000.00
0 PAGE NUMBER: 2 of 2
2
0 www.cactuslovers.com
-
WALTER GUMMY
PROFESSIONAL ICE CREAM TASTER
123 FROSTY LN
ICECREAMVILLE, IV 54321
-
Join the "Cactus Lovers Club" for exclusive cactus-themed merchandise and discounts.
Visit our website to see the world's largest cactus collection!
0 SUMMARY OF YOUR ACCOUNTS
0 ______________________________________________________________________________________________________________________
| | | |
| SUFFIX 002 MYSTERY COINS | | |
| JOINT: WALTER GUMMY | | |
| STATEMENT PERIOD 02/20/25 - 02/20/25 | | |
| BEGINNING BALANCE 2,500.00 | | |
| DEPOSITS 200.00 | | |
| WITHDRAWALS 100.00 | | |
| ICE CREAM CLEARED 0.00 | | |
| ENDING BALANCE 2,600.00 | | |
| | | |
| CUPCAKE YEAR-TO-DATE 50.00 | | |
| CUPCAKE THIS PERIOD 10.00 | | |
| AVERAGE CHOCOLATE COIN BALANCE 1,000.00 | | |
| DAYS ICE CREAM TASTED 5 | | |
| ANNUAL ICE CREAM TASTER REWARD 10.00% | | |
| | | |
______________________________________________________________________________________________________________________
0 SUFFIX 002 MYSTERY COINS
______________________________________________________________________________________________________________________
0 HISTORY
-------
DATE DESCRIPTION TRANSACTION AMOUNT ACCOUNT BALANCE
---- ----------- ------------------ ---------------
02/20/25 DEPOSIT FROM MARSHMALLOW FACTORY 200.00 2,600.00
A GUMMY REWARD OF 10.00 WILL BE POSTED TO YOUR ACCOUNT ON 02/20/25
______________________________________________________________________________________________________________________
Required Libraries:
pip install PyQt6 fpdf
Note:
When creating a profile you can set the font to 8.0 and cell height to 4.0.
Code:
import sys
import os
import re
import shutil
import sqlite3
from pathlib import Path
from datetime import datetime
from fpdf import FPDF
from PyQt6 import QtCore, QtWidgets
from PyQt6.QtGui import QIcon
from PyQt6.QtCore import QThread, pyqtSignal, Qt
from PyQt6.QtWidgets import (
QApplication, QMainWindow, QWidget, QVBoxLayout, QHBoxLayout,
QPushButton, QLabel, QComboBox, QStatusBar, QMessageBox,
QInputDialog, QListWidget, QProgressBar, QApplication, QMainWindow
)
# Define Directories
base_dir = Path(r'C:\path\to\base\dir')
input_dir = base_dir / '01-Input'
output_dir = base_dir / '02-Output'
processed_dir = base_dir / '03-Processed'
db_path = base_dir / 'profiles.db'
# Initialize the database
def init_db(db_path):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS profiles (
name TEXT PRIMARY KEY,
font_size REAL,
cell_height REAL
)
""")
conn.commit()
conn.close()
def fetch_profiles(db_path):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("SELECT name, font_size, cell_height FROM profiles")
profiles = {row[0]: {"font_size": row[1], "cell_height": row[2]} for row in cursor.fetchall()}
conn.close()
return profiles
def add_profile(db_path, name, font_size, cell_height):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
INSERT INTO profiles (name, font_size, cell_height)
VALUES (?, ?, ?)
ON CONFLICT(name) DO UPDATE SET
font_size = excluded.font_size,
cell_height = excluded.cell_height
""", (name, font_size, cell_height))
conn.commit()
conn.close()
def delete_profile_from_db(db_path, name):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("DELETE FROM profiles WHERE name = ?", (name,))
conn.commit()
conn.close()
# IBM Data Processing Logic
# Precompile the regular expressions for better performance
form_feed_pattern = re.compile(r'\$DJDE FORMS=NONE,FEED=(MAIN|AUX),FORMAT=(PAGE1|PAGE2),END;')
def replacement(content):
# Remove the specific forms and feeds using a single regex
content = form_feed_pattern.sub('', content)
# Remove the first character if it starts with '2' or '3'
if content.startswith(('2', '3')):
content = content[1:]
# Remove the period if it ends with ' .'
if content.endswith(' .'):
content = content.rstrip(' .')
return content
def process_ibm_data(file_path):
formatted_lines = []
page_started = False
page_count = 0
# Open the file in read mode
with open(file_path, 'r', encoding='cp1252') as file:
# Read file in chunks
chunk_size = 20 * 1024 * 1024 # 20MB chunk size
while chunk := file.read(chunk_size):
# Process each chunk
lines = chunk.splitlines()
for line in lines:
control_char = line[0]
content = line[1:].rstrip()
content = replacement(content) # Apply all replacements in one pass
# Process the line based on control character
if control_char == '1':
if page_started:
formatted_lines.append("\f") # Add page break for each page
page_started = True
page_count += 1
formatted_lines.append(content)
elif control_char == '0':
formatted_lines.append("") # Add an empty line
formatted_lines.append(content)
elif control_char == '-':
formatted_lines.append("") # Add two empty lines for spacing
formatted_lines.append("")
formatted_lines.append(content)
elif control_char == '+':
if formatted_lines:
formatted_lines[-1] += content # Append to the last line
else:
formatted_lines.append(content)
elif control_char == ' ':
if not page_started:
page_started = True
page_count += 1
formatted_lines.append(content)
return formatted_lines, page_count
# Create PDF logic
def create_pdf(output_path, lines, profile):
pdf = FPDF(format='letter')
pdf.set_auto_page_break(auto=True, margin=0.5)
pdf.set_margins(left=5, top=0.5, right=0.5)
pdf.add_page()
pdf.set_font("Courier", size=profile["font_size"])
for line in lines:
if line == "\f": # Add a new page when page break is encountered
pdf.add_page()
else:
pdf.multi_cell(0, profile["cell_height"], line, align="L") #Aligns text to the left
pdf.output(output_path)
# Check if a page is blank or only contains the number '1'
def is_blank_or_single_number_page(lines):
return all(line.strip() == "" or line.strip() == "1" for line in lines)
from concurrent.futures import ThreadPoolExecutor, as_completed
class FileProcessorThread(QThread):
processing_done = pyqtSignal()
processing_error = pyqtSignal(str)
processed_files = pyqtSignal(str, str, str, int)
progress_updated = pyqtSignal(int)
def __init__(self, input_files, output_dir, processed_subdir, profile, batch_size=500, parent=None):
super().__init__(parent)
self.input_files = input_files
self.output_dir = output_dir
self.processed_subdir = processed_subdir
self.profile = profile
self.batch_size = batch_size
def process_file(self, file_path):
current_date = datetime.now().strftime('%Y%m%d')
blank_page_count = 0
file_stem = file_path.stem
output_file_name = f'{file_stem}_{current_date}_Cleansed.PDF'
output_file_path = self.output_dir / output_file_name
# Process the IBM data and get the formatted lines and page count
formatted_lines, page_count = process_ibm_data(file_path)
# Count the total number of pages
total_pdf_pages = sum(1 for line in formatted_lines if line == "\f")
non_blank_lines = []
current_page_lines = []
current_page_num = 0
for line in formatted_lines:
if line == "\f":
if is_blank_or_single_number_page(current_page_lines):
blank_page_count += 1
else:
non_blank_lines.extend(current_page_lines)
non_blank_lines.append("\f")
current_page_num += 1
progress = int((current_page_num / total_pdf_pages) * 100)
self.progress_updated.emit(progress)
current_page_lines = []
else:
current_page_lines.append(line)
if not is_blank_or_single_number_page(current_page_lines):
non_blank_lines.extend(current_page_lines)
if current_page_lines:
current_page_num += 1
progress = int((current_page_num / total_pdf_pages) * 100)
self.progress_updated.emit(progress)
create_pdf(output_file_path, non_blank_lines, self.profile)
shutil.move(str(file_path), self.processed_subdir / file_path.name)
self.processed_files.emit(file_path.name, str(page_count), output_file_name, blank_page_count)
return blank_page_count
def run(self):
try:
blank_page_count = 0
total_files = len(self.input_files)
# Using ThreadPoolExecutor for parallel processing of files
with ThreadPoolExecutor() as executor:
futures = [executor.submit(self.process_file, file) for file in self.input_files]
for future in as_completed(futures):
blank_page_count += future.result()
# print(f"Total blank pages removed: {blank_page_count}\n")
self.processing_done.emit()
except Exception as e:
self.processing_error.emit(str(e))
class IBMFileProcessorApp(QMainWindow):
def __init__(self):
super().__init__()
# Set up main window
self.setWindowTitle("test")
self.setWindowIcon(QIcon(r"python_scripts\ibm carriage control\assets\icons\letter-r.ico"))
self.setGeometry(450, 250, 968, 394)
# Initialize database
init_db(db_path)
# Set up central widget and layout
self.central_widget = QWidget(self)
self.setCentralWidget(self.central_widget)
self.layout = QVBoxLayout(self.central_widget)
# Set up the splitter for the two sections (left for file explorer, right for dropped files and progress)
self.splitter = QtWidgets.QSplitter(self)
self.splitter.setOrientation(QtCore.Qt.Orientation.Horizontal)
self.layout.addWidget(self.splitter)
# Create UI Elements
self.create_left_side() # Left side: drag-and-drop file explorer
self.create_right_side() # Right side: dropped files list and progress bar
self.create_control_area()
self.create_file_list_area()
self.create_status_bar()
def create_left_side(self):
# Left side: Original file explorer logic
self.left_widget = QWidget(self.splitter)
self.left_layout = QVBoxLayout(self.left_widget)
self.drop_area_label = QLabel("Drag and Drop Files Here", self)
self.drop_area_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
self.drop_area_label.setStyleSheet("background-color: #1988ea; font: bold 12pt Arial; padding: 20px;")
self.left_layout.addWidget(self.drop_area_label)
self.drop_area_label.setAcceptDrops(True)
self.drop_area_label.dragEnterEvent = self.drag_enter_event
self.drop_area_label.dragMoveEvent = self.drag_move_event
self.drop_area_label.dropEvent = self.drop_event
def create_right_side(self):
# Right side: Display dropped file name(s) and progress bar
self.right_widget = QWidget(self.splitter)
self.right_layout = QVBoxLayout(self.right_widget)
# Label for the dropped files
self.dropped_files_label = QLabel("Dropped Files", self)
self.dropped_files_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
self.right_layout.addWidget(self.dropped_files_label)
# List widget to display the dropped file names
self.dropped_files_list = QListWidget(self)
self.right_layout.addWidget(self.dropped_files_list)
# Progress bar for showing the file processing progress
self.progress_bar = QProgressBar(self)
self.progress_bar.setRange(0, 100)
self.right_layout.addWidget(self.progress_bar)
def drag_enter_event(self, event):
# Only accept text files for drag-and-drop
if event.mimeData().hasUrls():
urls = event.mimeData().urls()
for url in urls:
if url.toLocalFile().lower().endswith('.txt'):
self.drop_area_label.setStyleSheet("background-color: #44b8ff; font: bold 12pt Arial; padding: 20px; border: 3px solid #005c99;")
event.acceptProposedAction()
return
event.ignore() # Ignore if it's not a text file
def drag_move_event(self, event):
# This keeps the hover effect while moving the file over the drop area
if event.mimeData().hasUrls():
urls = event.mimeData().urls()
for url in urls:
if url.toLocalFile().lower().endswith('.txt'):
self.drop_area_label.setStyleSheet("background-color: #44b8ff; font: bold 12pt Arial; padding: 20px; border: 3px solid #005c99;")
event.accept()
return
event.ignore()
def drop_event(self, event):
# Reset style and handle file drop
self.drop_area_label.setStyleSheet("background-color: #1988ea; font: bold 12pt Arial; padding: 20px;")
files = event.mimeData().urls()
for url in files:
file_path = Path(url.toLocalFile())
shutil.move(file_path, input_dir / file_path.name)
# Update the dropped files list in the right panel
self.update_dropped_files_list()
def update_dropped_files_list(self):
# This will update the dropped files list on the right panel
self.dropped_files_list.clear()
for file in os.listdir(input_dir):
self.dropped_files_list.addItem(file)
def create_control_area(self):
self.control_area = QWidget(self)
self.control_layout = QHBoxLayout(self.control_area)
self.layout.addWidget(self.control_area)
# Profile Dropdown
self.profile_dropdown = QComboBox(self)
self.profile_dropdown.addItems(self.get_profile_names())
self.control_layout.addWidget(self.profile_dropdown)
# Buttons
self.create_profile_button = QPushButton("New Profile", self)
self.create_profile_button.clicked.connect(self.create_new_profile)
self.control_layout.addWidget(self.create_profile_button)
self.edit_profile_button = QPushButton("Edit Profile", self)
self.edit_profile_button.clicked.connect(self.edit_profile)
self.control_layout.addWidget(self.edit_profile_button)
self.delete_profile_button = QPushButton("Delete Profile", self)
self.delete_profile_button.clicked.connect(self.delete_profile)
self.control_layout.addWidget(self.delete_profile_button)
self.process_button = QPushButton("Process Files", self)
self.process_button.clicked.connect(self.process_files)
self.control_layout.addWidget(self.process_button)
def create_file_list_area(self):
self.file_list_label = QLabel("Processed Files", self)
self.file_list_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
self.layout.addWidget(self.file_list_label)
self.files_list = QListWidget(self)
self.files_list.itemDoubleClicked.connect(self.open_file_or_directory)
self.layout.addWidget(self.files_list)
self.refresh_button = QPushButton("Refresh", self)
self.refresh_button.clicked.connect(self.update_files_list)
self.layout.addWidget(self.refresh_button)
def create_status_bar(self):
self.status_bar = QStatusBar(self)
self.setStatusBar(self.status_bar)
def create_progress_bar(self):
self.progress_bar = QProgressBar(self)
self.progress_bar.setRange(0, 100)
self.layout.addWidget(self.progress_bar)
def get_profile_names(self):
profiles = fetch_profiles(db_path)
return list(profiles.keys())
def update_files_list(self):
self.files_list.clear()
for file in os.listdir(output_dir):
self.files_list.addItem(file)
def create_new_profile(self):
name, ok = QInputDialog.getText(self, "New Profile", "Enter the profile name:")
if ok and name:
font_size, ok = QInputDialog.getDouble(self, "Font Size", "Enter font size:", min=1)
if ok:
cell_height, ok = QInputDialog.getDouble(self, "Cell Height", "Enter cell height:", min=1)
if ok:
add_profile(db_path, name, font_size, cell_height)
self.profile_dropdown.addItem(name)
QMessageBox.information(self, "Success", f"Profile '{name}' created.")
else:
QMessageBox.warning(self, "Invalid Input", "Please enter a valid cell height.")
else:
QMessageBox.warning(self, "Invalid Input", "Please enter a valid font size.")
else:
QMessageBox.warning(self, "Invalid Input", "Profile name cannot be empty.")
def edit_profile(self):
name = self.profile_dropdown.currentText()
if name:
profiles = fetch_profiles(db_path)
font_size, ok = QInputDialog.getDouble(self, "Font Size", "Enter the new font size:", value=profiles[name]["font_size"], min=1)
if ok:
cell_height, ok = QInputDialog.getDouble(self, "Cell Height", "Enter the new cell height:", value=profiles[name]["cell_height"], min=1)
if ok:
add_profile(db_path, name, font_size, cell_height)
self.profile_dropdown.setItemText(self.profile_dropdown.currentIndex(), name)
QMessageBox.information(self, "Success", f"Profile '{name}' updated.")
else:
QMessageBox.warning(self, "Invalid Input", "Please enter a valid cell height.")
else:
QMessageBox.warning(self, "Invalid Input", "Please enter a valid font size.")
else:
QMessageBox.warning(self, "Select Profile", "Please select a profile to edit.")
def delete_profile(self):
name = self.profile_dropdown.currentText()
if name:
reply = QMessageBox.question(self, "Delete Profile", f"Are you sure you want to delete profile '{name}'?",
QMessageBox.StandardButton.Yes | QMessageBox.StandardButton.No)
if reply == QMessageBox.StandardButton.Yes:
delete_profile_from_db(db_path, name)
self.profile_dropdown.removeItem(self.profile_dropdown.currentIndex())
QMessageBox.information(self, "Success", f"Profile '{name}' deleted.")
self.profile_dropdown.clear()
self.profile_dropdown.addItems(self.get_profile_names()) # Refresh profile list
else:
QMessageBox.warning(self, "Select Profile", "Please select a profile to delete.")
def process_files(self):
# Get selected profile
selected_profile_name = self.profile_dropdown.currentText()
if not selected_profile_name:
QMessageBox.warning(self, "No Profile Selected", "Please select a profile before processing files.")
return
profiles = fetch_profiles(db_path)
profile = profiles[selected_profile_name]
input_files = list(input_dir.glob("*.txt"))
if not input_files:
QMessageBox.warning(self, "No Files", "No files to process.")
return
# Disable the process button while processing
self.process_button.setDisabled(True)
# Start the file processing in a separate thread with a specified batch size (e.g., 5 files per batch)
self.processor_thread = FileProcessorThread(input_files, output_dir, processed_dir, profile, batch_size=700)
self.processor_thread.progress_updated.connect(self.update_progress_bar)
self.processor_thread.processing_done.connect(self.processing_done)
self.processor_thread.processing_error.connect(self.processing_error)
self.processor_thread.processed_files.connect(self.file_processed)
self.processor_thread.start()
def update_progress_bar(self, progress):
self.progress_bar.setValue(progress)
def processing_done(self):
QMessageBox.information(self, "Processing Complete", "All files have been processed.")
self.process_button.setDisabled(False)
self.progress_bar.setValue(0)
def processing_error(self, error_message):
QMessageBox.critical(self, "Error", f"An error occurred during processing: {error_message}")
self.process_button.setDisabled(False)
def file_processed(self, file_name, page_count, output_file_name, blank_page_count):
# Add info to status bar about the processed file
self.status_bar.showMessage(f"Processed: {file_name}, Pages: {page_count}, Output: {output_file_name}, Blank Pages Removed: {blank_page_count}")
def open_file_or_directory(self, item):
# Open the file or directory when clicked
file_path = output_dir / item.text()
if file_path.is_dir():
os.startfile(file_path)
else:
os.startfile(file_path)
if __name__ == '__main__':
app = QApplication(sys.argv)
window = IBMFileProcessorApp()
window.show()
sys.exit(app.exec())
1 Answer 1
get a second opinion on possibly a more powerful library that will allow me to quickly convert these text files to PDF.
I don't have any advice, but perhaps others will. In the meantime, keep researching.
There may be a few other things you can try to isolate the speed problem. For example, it is not clear why you are using SQL to translate a plain text file to PDF. If SQL is not essential to the conversion, remove it temporarily to see if anything improves.
Also, it is not clear why you need the PyQt6
GUI for the conversion to PDF.
Again, remove the GUI temporarily. It is not likely that will help with
the speed up, but at least it isolates the problem.
Finally, use a profiling tool to see if anything is taking longer than expected.
The remaining suggestions are purely for code style.
Layout
Move the class to the top after the import
lines. Move the other functions
after the class. Having them in the middle of the code interrupts the natural
flow of the code (from a human readability standpoint).
Also, move this import
to the top with all the other import
lines:
from concurrent.futures import ThreadPoolExecutor, as_completed
Documentation
The PEP 8 style guide recommends adding docstrings for classes and functions. The class docstring should summarize the purpose of the code.
For functions, you can convert comments like this:
# Initialize the database
def init_db(db_path):
into docstrings:
def init_db(db_path):
""" Initialize the database """
You should add details regarding what kind of database you are using (what does it store) and what you are initializing it to.
Other function docstrings should describe input types and return types.
DRY
There are duplicate lines in the open_file_or_directory
function.
The startfile
call is the same in both branches of the if/else
:
if file_path.is_dir():
os.startfile(file_path)
else:
os.startfile(file_path)
Unless that is a bug, this code does the same thing without the repetition:
def open_file_or_directory(self, item):
# Open the file or directory when clicked
file_path = output_dir / item.text()
os.startfile(file_path)
Tools
You could run code development tools to automatically find some style issues with your code.
ruff
finds things like:
F811 [*] Redefinition of unused `QApplication` from line
|
| QApplication, QMainWindow, QWidget, QVBoxLayout, QHBoxLayout,
| QPushButton, QLabel, QComboBox, QStatusBar, QMessageBox,
| QInputDialog, QListWidget, QProgressBar, QApplication, QMainWindow
| ^^^^^^^^^^^^ F811
| )
|
= help: Remove definition: `QApplication`
Also:
= help: Remove definition: `QMainWindow`
Explore related questions
See similar questions with these tags.