What is the most optimal way to concat XML files as one large string/text file?

Question 1

Python3 - Recursive XML Files to Large Text File

I have a recursive function that looks through a directory (there are 6000+ files).

For each XML file found - the XML is converted to a string and appended to an array.

At the end of the recursion - All these arrays are joined to return a (very) large string.

The error I am getting is MemoryError - How can I optimise my code?

import os
import xml.etree.ElementTree as ET
# Cycle through a directory and concat files together - recursive dive.
# NOTE:
# Have a folder called 'PLC 2360' in the immediate directory.
# The output is a huge text file
path = 'PLC 2360'
big_text = []
def big_file(path):
 try:
 for i in next(os.walk(path))[2]: # For each XML file read and convert to string
 xml = ET.parse(path + '/' + i).getroot()
 big_text.append(ET.tostring(xml, encoding='unicode')) # Append string of XML
 except:
 print('Error on', path)
 if next(os.walk(path))[1]: # If folders exist - cycle through
 for i in next(os.walk(path))[1]: 
 big_file(path + '/' + i) # Enter next recursion layer
 return ''.join(str(i) for i in big_text) # Return one big string
lis = big_file(path)
print(len(lis))
print(lis[:500])
```

Question 2

"and appended to an array" Why? If you want to append to a file, append to a file. What's the array for? There are only files being read by your code, not written. I don't see a target text file anywhere, so it looks like the code as written does not produce the required result yet. Please take a look at the help center.

Question 3

I'm assuming you're doing this to create plain-text exports of PLC tag data?

Question 4

@Mast - I am appending to an array currently in case I need to do some specific preprocessing of certain files - my output is currently a string from the function to which I can write to a file later. Yes, PLC tags, good spot - I am also considering using Threading to shoot off threads but not sure how well that will work with recursion.

Question 5

Threading may only make your problem worse at this point, since you'll run out of memory even faster. Keeping the data in memory because you may need to do something sounds like a good reason, but it's really not feasible with the amount of data you're handling. Is keeping one file in memory at once an acceptable alternative? I can't help but feel there must be more code than you're showing. On Code Review, details matter.

Question 6

There are options for compressing, but as long as you have the storage space, it does not matter initially. Your tag files are not going to fill your entire memory if you load them one-by-one, they're not big enough for that. The problem is you're trying to keep them all in memory at once. That's not going to work if you have thousands of files.

Question 7

functions

Your script does a few things

iterate over the files
recursively iterate over the subdirectories
parse the files
concatenate the results

Better would be to separate those in different functions.

The extra advantages is that you can test these functions separately, you can document them with a docstring, and add typing information

comments

You comment what the code does. Python is luckily so expressive that almost anyone can understand what a particular line does. what is more difficult is the why you do certain steps, and why you do them in a certain order. This is what should be commented

global variables

Your big_file method changes the global state of the program. That makes it more difficult to reason about, and also makes it difficult if you want to use this methon on 2 separate directories. Here you append to big_text. If you want to keep it like this, I would pass it on as a function parameter, instead of a global variable

def big_file(path, big_text = None):
 if big_text is None:
 big_text = []
 
 ...
 big_file(path + '/' + i, big_text=big_text)

`pathlib.Path`

Most file operations are simpler when using the pathlib module. It will be a lot more robust than manually concatenating paths like in path + '/' + i

error handling

You have a try-except block with a bare except. Better here would be to catch the errors you expect specifically and handle those, and let other, unexpected error bubble up. Fail hard, fail fast, instead of covering up bugs can help you write more stable and correct software

logging

Instead of using print, you can use the logging module. That way you can make a distinction between different levels of importance, and filter some out if needed

generators

To prevent a MemoryError you can use generators. these are special functions that do their work piece by piece, and can work without keeping the whole structure in memory

You can have 1 generator generate the files

def iterate_files(path: Path) -> typing.Iterator[Path]:
 """Recursively iterates over `path`, yielding all the correct files"""
 for file in path.glob("*"):
 if file.is_dir():
 yield from iterate_files(file)
 else:
 # or a check that the file has a certain suffix
 yield file

Then you feed this iterator to the parser generator

def parse_files(files: typing.Iterator[Path]) -> typing.Iterator[str]:
 """Parse the xml files."""
 for file in files:
 try:
 xml = ET.parse(path + '/' + i).getroot()
 yield ET.tostring(xml, encoding='unicode')
 except <stricter exception>:
 logging.warn(f"error in {file}")
 raise

In the last except, you can have different except blocks with different result

You can then feed this to another generator which writes it to a file:

def write_to_file(
 text_iterator: Typing.Iterable[str], output_filehandle: typing.TextIO
) -> Typing.Iterable[str]:
 for chunk in text_iterator:
 output_filehandle.write(chunk)
 yield chunk

putting it together

if __name__ == "__main__":
 path = Path("PLC 2360")
 files = iterate_files(path)
 parsed_filed = parse_files(files)
 
 with Path(<output_path>).open("w") as output_filehandle:
 parsed_filed_after_writing = write_to_file(
 text_iterator=parse_files, output_filehandle=output_filehandle
 )
 ...

In that last part, I've opened the file in the main part of the script, taking into account the principles of clean architecture

Question 8

Very Much Appreciated - These are all valuable and incredibly neat points - I hope to develop my programming structure skills to this kind of standard - Awesome! Learned tonnes.

Question 9

You don't have to accept straight away. You can wait a bit and give other people some time give their feedback as well

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Answer 1 · 2020-06-16 10:56:12Z

functions

Your script does a few things

iterate over the files
recursively iterate over the subdirectories
parse the files
concatenate the results

Better would be to separate those in different functions.

The extra advantages is that you can test these functions separately, you can document them with a docstring, and add typing information

comments

You comment what the code does. Python is luckily so expressive that almost anyone can understand what a particular line does. what is more difficult is the why you do certain steps, and why you do them in a certain order. This is what should be commented

global variables

Your big_file method changes the global state of the program. That makes it more difficult to reason about, and also makes it difficult if you want to use this methon on 2 separate directories. Here you append to big_text. If you want to keep it like this, I would pass it on as a function parameter, instead of a global variable

def big_file(path, big_text = None):
 if big_text is None:
 big_text = []
 
 ...
 big_file(path + '/' + i, big_text=big_text)

`pathlib.Path`

Most file operations are simpler when using the pathlib module. It will be a lot more robust than manually concatenating paths like in path + '/' + i

error handling

You have a try-except block with a bare except. Better here would be to catch the errors you expect specifically and handle those, and let other, unexpected error bubble up. Fail hard, fail fast, instead of covering up bugs can help you write more stable and correct software

logging

Instead of using print, you can use the logging module. That way you can make a distinction between different levels of importance, and filter some out if needed

generators

To prevent a MemoryError you can use generators. these are special functions that do their work piece by piece, and can work without keeping the whole structure in memory

You can have 1 generator generate the files

def iterate_files(path: Path) -> typing.Iterator[Path]:
 """Recursively iterates over `path`, yielding all the correct files"""
 for file in path.glob("*"):
 if file.is_dir():
 yield from iterate_files(file)
 else:
 # or a check that the file has a certain suffix
 yield file

Then you feed this iterator to the parser generator

def parse_files(files: typing.Iterator[Path]) -> typing.Iterator[str]:
 """Parse the xml files."""
 for file in files:
 try:
 xml = ET.parse(path + '/' + i).getroot()
 yield ET.tostring(xml, encoding='unicode')
 except <stricter exception>:
 logging.warn(f"error in {file}")
 raise

In the last except, you can have different except blocks with different result

You can then feed this to another generator which writes it to a file:

def write_to_file(
 text_iterator: Typing.Iterable[str], output_filehandle: typing.TextIO
) -> Typing.Iterable[str]:
 for chunk in text_iterator:
 output_filehandle.write(chunk)
 yield chunk

putting it together

if __name__ == "__main__":
 path = Path("PLC 2360")
 files = iterate_files(path)
 parsed_filed = parse_files(files)
 
 with Path(<output_path>).open("w") as output_filehandle:
 parsed_filed_after_writing = write_to_file(
 text_iterator=parse_files, output_filehandle=output_filehandle
 )
 ...

In that last part, I've opened the file in the main part of the script, taking into account the principles of clean architecture

Very Much Appreciated - These are all valuable and incredibly neat points - I hope to develop my programming structure skills to this kind of standard - Awesome! Learned tonnes.
You don't have to accept straight away. You can wait a bit and give other people some time give their feedback as well

Stack Exchange Network

What is the most optimal way to concat XML files as one large string/text file?

Python3 - Recursive XML Files to Large Text File

1 Answer 1

functions

comments

global variables

`pathlib.Path`

error handling

logging

generators

putting it together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

What is the most optimal way to concat XML files as one large string/text file?

Python3 - Recursive XML Files to Large Text File

1 Answer 1

functions

comments

global variables

pathlib.Path

error handling

logging

generators

putting it together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`pathlib.Path`