Python3 - Recursive XML Files to Large Text File
I have a recursive function that looks through a directory (there are 6000+ files).
For each XML
file found - the XML is converted to a string and appended to an array.
At the end of the recursion - All these arrays are joined to return a (very) large string.
The error I am getting is MemoryError
- How can I optimise my code?
import os
import xml.etree.ElementTree as ET
# Cycle through a directory and concat files together - recursive dive.
# NOTE:
# Have a folder called 'PLC 2360' in the immediate directory.
# The output is a huge text file
path = 'PLC 2360'
big_text = []
def big_file(path):
try:
for i in next(os.walk(path))[2]: # For each XML file read and convert to string
xml = ET.parse(path + '/' + i).getroot()
big_text.append(ET.tostring(xml, encoding='unicode')) # Append string of XML
except:
print('Error on', path)
if next(os.walk(path))[1]: # If folders exist - cycle through
for i in next(os.walk(path))[1]:
big_file(path + '/' + i) # Enter next recursion layer
return ''.join(str(i) for i in big_text) # Return one big string
lis = big_file(path)
print(len(lis))
print(lis[:500])
```
-
1\$\begingroup\$ "and appended to an array" Why? If you want to append to a file, append to a file. What's the array for? There are only files being read by your code, not written. I don't see a target text file anywhere, so it looks like the code as written does not produce the required result yet. Please take a look at the help center. \$\endgroup\$Mast– Mast ♦2020年06月16日 07:07:57 +00:00Commented Jun 16, 2020 at 7:07
-
1\$\begingroup\$ I'm assuming you're doing this to create plain-text exports of PLC tag data? \$\endgroup\$Mast– Mast ♦2020年06月16日 07:11:09 +00:00Commented Jun 16, 2020 at 7:11
-
\$\begingroup\$ @Mast - I am appending to an array currently in case I need to do some specific preprocessing of certain files - my output is currently a string from the function to which I can write to a file later. Yes, PLC tags, good spot - I am also considering using Threading to shoot off threads but not sure how well that will work with recursion. \$\endgroup\$leopardxpreload– leopardxpreload2020年06月16日 08:14:57 +00:00Commented Jun 16, 2020 at 8:14
-
1\$\begingroup\$ Threading may only make your problem worse at this point, since you'll run out of memory even faster. Keeping the data in memory because you may need to do something sounds like a good reason, but it's really not feasible with the amount of data you're handling. Is keeping one file in memory at once an acceptable alternative? I can't help but feel there must be more code than you're showing. On Code Review, details matter. \$\endgroup\$Mast– Mast ♦2020年06月16日 08:29:05 +00:00Commented Jun 16, 2020 at 8:29
-
1\$\begingroup\$ There are options for compressing, but as long as you have the storage space, it does not matter initially. Your tag files are not going to fill your entire memory if you load them one-by-one, they're not big enough for that. The problem is you're trying to keep them all in memory at once. That's not going to work if you have thousands of files. \$\endgroup\$Mast– Mast ♦2020年06月16日 09:08:11 +00:00Commented Jun 16, 2020 at 9:08
1 Answer 1
functions
Your script does a few things
- iterate over the files
- recursively iterate over the subdirectories
- parse the files
- concatenate the results
Better would be to separate those in different functions.
The extra advantages is that you can test these functions separately, you can document them with a docstring, and add typing information
comments
You comment what the code does. Python is luckily so expressive that almost anyone can understand what a particular line does. what is more difficult is the why you do certain steps, and why you do them in a certain order. This is what should be commented
global variables
Your big_file
method changes the global state of the program. That makes it more difficult to reason about, and also makes it difficult if you want to use this methon on 2 separate directories. Here you append to big_text
. If you want to keep it like this, I would pass it on as a function parameter, instead of a global variable
def big_file(path, big_text = None):
if big_text is None:
big_text = []
...
big_file(path + '/' + i, big_text=big_text)
pathlib.Path
Most file operations are simpler when using the pathlib
module. It will be a lot more robust than manually concatenating paths like in path + '/' + i
error handling
You have a try-except
block with a bare except. Better here would be to catch the errors you expect specifically and handle those, and let other, unexpected error bubble up. Fail hard, fail fast
, instead of covering up bugs can help you write more stable and correct software
logging
Instead of using print
, you can use the logging
module. That way you can make a distinction between different levels of importance, and filter some out if needed
generators
To prevent a MemoryError
you can use generators. these are special functions that do their work piece by piece, and can work without keeping the whole structure in memory
You can have 1 generator generate the files
def iterate_files(path: Path) -> typing.Iterator[Path]:
"""Recursively iterates over `path`, yielding all the correct files"""
for file in path.glob("*"):
if file.is_dir():
yield from iterate_files(file)
else:
# or a check that the file has a certain suffix
yield file
Then you feed this iterator to the parser generator
def parse_files(files: typing.Iterator[Path]) -> typing.Iterator[str]:
"""Parse the xml files."""
for file in files:
try:
xml = ET.parse(path + '/' + i).getroot()
yield ET.tostring(xml, encoding='unicode')
except <stricter exception>:
logging.warn(f"error in {file}")
raise
In the last except, you can have different except
blocks with different result
You can then feed this to another generator which writes it to a file:
def write_to_file(
text_iterator: Typing.Iterable[str], output_filehandle: typing.TextIO
) -> Typing.Iterable[str]:
for chunk in text_iterator:
output_filehandle.write(chunk)
yield chunk
putting it together
if __name__ == "__main__":
path = Path("PLC 2360")
files = iterate_files(path)
parsed_filed = parse_files(files)
with Path(<output_path>).open("w") as output_filehandle:
parsed_filed_after_writing = write_to_file(
text_iterator=parse_files, output_filehandle=output_filehandle
)
...
In that last part, I've opened the file in the main part of the script, taking into account the principles of clean architecture
-
\$\begingroup\$ Very Much Appreciated - These are all valuable and incredibly neat points - I hope to develop my programming structure skills to this kind of standard - Awesome! Learned tonnes. \$\endgroup\$leopardxpreload– leopardxpreload2020年06月16日 11:03:46 +00:00Commented Jun 16, 2020 at 11:03
-
1\$\begingroup\$ You don't have to accept straight away. You can wait a bit and give other people some time give their feedback as well \$\endgroup\$Maarten Fabré– Maarten Fabré2020年06月16日 11:49:23 +00:00Commented Jun 16, 2020 at 11:49
Explore related questions
See similar questions with these tags.