Python script copying specific table data from multiple docx files into an XLS file

Question 1

The code copies the second column from the right of all the docx files' tables from the folder where the py file is saved, and does some editing on the content in specific cases. The problem that it is painfully slow; I tested it with a folder with 38 files with about 22000 lines total (which is a kinda usual amount of what I would use it daily) and it took 2.5 hours to finish.

import os
from docx import Document
from openpyxl import Workbook
folder = os.path.dirname(os.path.abspath(__file__))
wb = Workbook()
ws = wb.active
for filename in os.listdir(folder):
 if filename.endswith(".docx"):
 doc = Document(os.path.join(folder, filename))
 for table in doc.tables:
 ws.append(["File name: " + filename])
 for row in table.rows:
 ws.append([row.cells[-2].text])
wb.save("output.xlsx")
for row in ws.iter_rows():
 for cell in row:
 if cell.value and (cell.value[0] in ["+", "-", "="]):
 cell.value = "'" + cell.value
wb.save("output.xlsx")

Question 2

This question is about elapsed time. It should include observations from cProfile.

More than two hours for a smallish number of rows? Less than three lines per second? Wow, that's impressively slow.

There's not a lot going on here. So all I can imagine is that this might have O(N^2) quadratic performance:

ws = wb.active
...
 for row in table.rows:
 ws.append([row.cells[-2].text])

To verify, simply comment out those two lines (or turn the append line into pass) and do a timing run.

If my guess is correct and that is the source of delay, then consider accumulating rows in a list rather than a worksheet. And append the rows all at once.

Or, almost the same, use len( ... ) of that list to pre-extend the worksheet so it has the proper number of rows, and then store each list element in the proper row.

Theory: Some APIs suffer from high cost of repeatedly appending single element.

We saw this for example in early versions of the cPython interpreter, which led to this idiom:

lines = []
for n in range(1_000_000):
 lines.append(f"{n} bottles of beer on the wall")
return "\n".join(lines)

Nowadays the allocation behavior is better so it is safe to write

lines = ""
for n in range(1_000_000):
 lines += f"{n} bottles of beer on the wall\n"
return lines

What changed? The allocator now "wastes" some memory when extending a string, anticipating that this might not be the last such extension. Crucially, it uses a multiplicative factor, such as doubling the allocation or even multiplying current length by, say, 1.3. Any factor greater than one would suffice.

We still have to occasionally do O(N) linear work to copy existing data into a newly allocated buffer. But with amortization, each extension operation has just O(log N) logarithmic cost.

The list allocator always had that behavior.

Numpy's allocator exhibits similar quadratic badness, so there is strong incentive to pre-allocate an appropriate number of rows from the get go.

I don't know about worksheet behavior. But that's my guess. Let us know how close I came to the mark.

As a backup plan, consider using the very fast csv module to create a giant output.csv file. And finally turn that file into the desired XLSX format in a single operation. Guaranteed to go faster than three rows per second.

J_H J_H 41.7k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2023-01-19 19:54:01Z

This question is about elapsed time. It should include observations from cProfile.

More than two hours for a smallish number of rows? Less than three lines per second? Wow, that's impressively slow.

There's not a lot going on here. So all I can imagine is that this might have O(N^2) quadratic performance:

ws = wb.active
...
 for row in table.rows:
 ws.append([row.cells[-2].text])

To verify, simply comment out those two lines (or turn the append line into pass) and do a timing run.

If my guess is correct and that is the source of delay, then consider accumulating rows in a list rather than a worksheet. And append the rows all at once.

Or, almost the same, use len( ... ) of that list to pre-extend the worksheet so it has the proper number of rows, and then store each list element in the proper row.

Theory: Some APIs suffer from high cost of repeatedly appending single element.

We saw this for example in early versions of the cPython interpreter, which led to this idiom:

lines = []
for n in range(1_000_000):
 lines.append(f"{n} bottles of beer on the wall")
return "\n".join(lines)

Nowadays the allocation behavior is better so it is safe to write

lines = ""
for n in range(1_000_000):
 lines += f"{n} bottles of beer on the wall\n"
return lines

What changed? The allocator now "wastes" some memory when extending a string, anticipating that this might not be the last such extension. Crucially, it uses a multiplicative factor, such as doubling the allocation or even multiplying current length by, say, 1.3. Any factor greater than one would suffice.

We still have to occasionally do O(N) linear work to copy existing data into a newly allocated buffer. But with amortization, each extension operation has just O(log N) logarithmic cost.

The list allocator always had that behavior.

Numpy's allocator exhibits similar quadratic badness, so there is strong incentive to pre-allocate an appropriate number of rows from the get go.

I don't know about worksheet behavior. But that's my guess. Let us know how close I came to the mark.

As a backup plan, consider using the very fast csv module to create a giant output.csv file. And finally turn that file into the desired XLSX format in a single operation. Guaranteed to go faster than three rows per second.

Stack Exchange Network

Python script copying specific table data from multiple docx files into an XLS file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python script copying specific table data from multiple docx files into an XLS file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions