The code copies the second column from the right of all the docx files' tables from the folder where the py file is saved, and does some editing on the content in specific cases. The problem that it is painfully slow; I tested it with a folder with 38 files with about 22000 lines total (which is a kinda usual amount of what I would use it daily) and it took 2.5 hours to finish.
import os
from docx import Document
from openpyxl import Workbook
folder = os.path.dirname(os.path.abspath(__file__))
wb = Workbook()
ws = wb.active
for filename in os.listdir(folder):
if filename.endswith(".docx"):
doc = Document(os.path.join(folder, filename))
for table in doc.tables:
ws.append(["File name: " + filename])
for row in table.rows:
ws.append([row.cells[-2].text])
wb.save("output.xlsx")
for row in ws.iter_rows():
for cell in row:
if cell.value and (cell.value[0] in ["+", "-", "="]):
cell.value = "'" + cell.value
wb.save("output.xlsx")
1 Answer 1
This question is about elapsed time. It should include observations from cProfile.
More than two hours for a smallish number of rows? Less than three lines per second? Wow, that's impressively slow.
There's not a lot going on here. So all I can imagine is that this might have O(N^2) quadratic performance:
ws = wb.active
...
for row in table.rows:
ws.append([row.cells[-2].text])
To verify, simply comment out those two lines
(or turn the append line into pass
)
and do a timing run.
If my guess is correct and that is the source
of delay, then consider accumulating rows
in a list
rather than a worksheet.
And append the rows all at once.
Or, almost the same, use len( ... )
of that
list to pre-extend the worksheet so it
has the proper number of rows, and
then store each list element in the proper row.
Theory: Some APIs suffer from high cost of repeatedly appending single element.
We saw this for example in early versions of the cPython interpreter, which led to this idiom:
lines = []
for n in range(1_000_000):
lines.append(f"{n} bottles of beer on the wall")
return "\n".join(lines)
Nowadays the allocation behavior is better so it is safe to write
lines = ""
for n in range(1_000_000):
lines += f"{n} bottles of beer on the wall\n"
return lines
What changed? The allocator now "wastes" some memory when extending a string, anticipating that this might not be the last such extension. Crucially, it uses a multiplicative factor, such as doubling the allocation or even multiplying current length by, say, 1.3. Any factor greater than one would suffice.
We still have to occasionally do O(N) linear work to copy existing data into a newly allocated buffer. But with amortization, each extension operation has just O(log N) logarithmic cost.
The list
allocator always had that behavior.
Numpy's allocator exhibits similar quadratic badness, so there is strong incentive to pre-allocate an appropriate number of rows from the get go.
I don't know about worksheet behavior. But that's my guess. Let us know how close I came to the mark.
As a backup plan, consider using the very fast
csv
module to create a giant output.csv
file.
And finally turn that file into the desired XLSX format
in a single operation.
Guaranteed to go faster than three rows per second.