File size when parsing XML

Question 1

The code below takes a directory of XML files and parses them into a CSV file. Right now parsing around 60 XML files is fast and the output is a CSV file that is around 250MB.

That is a really big file and the reason is because columns are being repeated. I'm repeating the columns for the reason that every element should have all the information. In red is one of the cases where the ID Z048 had multiple lines of setdata so that is why the other columns in red had to be repeated.

I'm planning to increase the number of XML files to 5k, meaning that the CSV file will be relatively large.

Asking this question to maybe get any answer if the size of my CSV file can be lowered somehow. Even though I tried to code with the mindset that I want my code to be fast and produce not too big CSV files.

from xml.etree import ElementTree as ET
from collections import defaultdict
import csv
from pathlib import Path
directory = 'path to a folder with xml files'
with open('output.csv', 'w', newline='') as f:
 writer = csv.writer(f)
 headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
 writer.writerow(headers)
 xml_files_list = list(map(str, Path(directory).glob('**/*.xml')))
 print(xml_files_list)
 for xml_file in xml_files_list:
 tree = ET.parse(xml_file)
 root = tree.getroot()
 start_nodes = root.findall('.//START')
 for sn in start_nodes:
 row = defaultdict(str)
 repeated_values = dict()
 for k,v in sn.attrib.items():
 repeated_values[k] = v
 for rn in sn.findall('.//Rational'):
 repeated_values['rational'] = rn.text
 for qu in sn.findall('.//Qualify'):
 repeated_values['qualify'] = qu.text
 for ds in sn.findall('.//Description'):
 repeated_values['description_txt'] = ds.text
 repeated_values['description_num'] = ds.attrib['num']
 for st in sn.findall('.//SetData'):
 for k,v in st.attrib.items():
 row['set_data_'+ str(k)] = v
 for key in repeated_values.keys():
 row[key] = repeated_values[key]
 row_data = [row[i] for i in headers]
 writer.writerow(row_data)
 row = defaultdict(str)

Question 2

This is a re-post of this off-topic question, that you deleted.

Question 3

@Peilonrayz, if you read the question you will see that it is not. The code works and not asking about long numbers. Really feeling the negativity that is coming across. I'm asking for advice I am not coming here with not a working code and then my post gets a downvote

Question 4

@ebe: In any case, just to clarify, does the code posted here work correctly (i.e. did you fix the issue you had in your other question or does it not matter anymore)? Can you give us a (small) example XML file or explain more how you parse it, i.e. its structure? You might also want to edit your title to be more descriptive of the problem your code solves, rather than the issues you see with your current code. Otherwise we would get too many questions asking for "improvement".

Question 5

Also, I removed the pandas tag, because it is not used anywhere in the code and fixed up some grammar. Feel free to further edit in case any of the changes changed your intention.

Question 6

@Graipher, thank you for editing, all of it is fine. The code works, the problem I had before maybe was not suited because the code parses it right so I fixed it inside Excel. However I would really appreciate the help here. Also I would appreciate your suggestion on how I should rename the task

Question 7

At this point your code is starting to become hard to follow. If you add any more complicated logic, you will not be able to easily understand it yourself if you return to it a few weeks later.

I would first of all separate out the parsing of an XML file so that you can change that independently of all the rest. You can make this a generator, so it doesn't need to know what happens with the data it provides (being written to a CSV file, printed to the terminal, shucked into /dev/null/, ...).

Using findall to find one element is not the right approach. Either there is exactly one e.g. './/Rational' element, in which case you can just use find, or there is not and you have to store all of them (or rethink your XML design).

You can use a dictionary comprehension and the relatively new f-strings to easily generate your row data (don't worry about missing keys here, we'll take care of them outside).

def parse_file(xml_file):
 tree = ET.parse(xml_file)
 for node in tree.getroot().findall('.//START'):
 repeated_values = dict(node.attrib)
 repeated_values['rational'] = node.find('.//Rational').text
 repeated_values['qualify'] = node.find('.//Qualify').text
 description = node.find('.//Description')
 repeated_values['description_txt'] = description.text
 repeated_values['description_num'] = description.get('num')
 for data in node.findall('.//SetData'):
 row = {f"set_data_{key}": value for key, value in data.attrib.items())}
 row.update(repeated_values)
 yield row

The csv module has a DictReader and, more importantly, a DictWriter class. These are a bit easier to use since you don't need to enforce the correct order in the parsing. It also has the writerows method (like all other csv writers) that can take an iterable of dictionaries to write. This is slightly faster than writing each row manually, and also more readable.

def join_xml_files(files, output_file, headers):
 with open(output_file, 'w') as f:
 writer = csv.DictWriter(f, headers)
 writer.writeheader()
 for xml_file in files:
 writer.writerows(parse_file(xml_file))

Note that restvals="" is already set by default and replaces missing keys with an empty string, so no need for collections.defaultdict(str) anymore.

You can then put the actual calling code under a if __name__ == "__main__": guard which allows you to import from this script without running it:

if __name__ == "__main__":
 directory = 'path to a folder with xml files'
 xml_files = map(str, Path(directory).glob('**/*.xml'))
 headers = ['id', 'service_code', 'rational', 'qualify', 'description_num',
 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value',
 'set_data_x']
 # actually call the function
 join_xml_files(xml_files, "output.csv", headers)

As to reducing the file size: that's not possible without removing information. You could zip up the resulting file (it would still be just as large in memory when unzipping it, though), or you can leave out the redundant information (but then all programs reading this file need to fill in the information when necessary).

Question 8

thank you for your answer, I understand that the size of the file can't be reduced. Really helpful the idea of breaking it down into smaller parts so later on they could be modified easier.

Question 9

can you give me any advice. Once the csv file is generated I try to read the csv file. I'am getting this error ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2.

Question 10

@ebe: That sounds like the CSV is malformed, you should take a look directly at the file in a text editor to see exactly what the problem is. Maybe it is newlines, maybe quoting, maybe the separator, it could be many things.

Question 11

I'm getting an SyntaxError: invalid syntax at the line row = {f"set_data_{key}": value for key, value in data.attrib.items())}

Question 12

@ebe: Which version of Python are you using? The f-string I used there is Python 3.6+. If you are on an older version, use row = {"set_data_{}".format(key): value for key, value in data.attrib.items())} instead.

Graipher GraipherGraipher 41.6k7 gold badges70 silver badges134 bronze badges · Answer 1 · 2020-01-23 13:40:16Z

At this point your code is starting to become hard to follow. If you add any more complicated logic, you will not be able to easily understand it yourself if you return to it a few weeks later.

I would first of all separate out the parsing of an XML file so that you can change that independently of all the rest. You can make this a generator, so it doesn't need to know what happens with the data it provides (being written to a CSV file, printed to the terminal, shucked into /dev/null/, ...).

Using findall to find one element is not the right approach. Either there is exactly one e.g. './/Rational' element, in which case you can just use find, or there is not and you have to store all of them (or rethink your XML design).

You can use a dictionary comprehension and the relatively new f-strings to easily generate your row data (don't worry about missing keys here, we'll take care of them outside).

def parse_file(xml_file):
 tree = ET.parse(xml_file)
 for node in tree.getroot().findall('.//START'):
 repeated_values = dict(node.attrib)
 repeated_values['rational'] = node.find('.//Rational').text
 repeated_values['qualify'] = node.find('.//Qualify').text
 description = node.find('.//Description')
 repeated_values['description_txt'] = description.text
 repeated_values['description_num'] = description.get('num')
 for data in node.findall('.//SetData'):
 row = {f"set_data_{key}": value for key, value in data.attrib.items())}
 row.update(repeated_values)
 yield row

The csv module has a DictReader and, more importantly, a DictWriter class. These are a bit easier to use since you don't need to enforce the correct order in the parsing. It also has the writerows method (like all other csv writers) that can take an iterable of dictionaries to write. This is slightly faster than writing each row manually, and also more readable.

def join_xml_files(files, output_file, headers):
 with open(output_file, 'w') as f:
 writer = csv.DictWriter(f, headers)
 writer.writeheader()
 for xml_file in files:
 writer.writerows(parse_file(xml_file))

Note that restvals="" is already set by default and replaces missing keys with an empty string, so no need for collections.defaultdict(str) anymore.

You can then put the actual calling code under a if __name__ == "__main__": guard which allows you to import from this script without running it:

if __name__ == "__main__":
 directory = 'path to a folder with xml files'
 xml_files = map(str, Path(directory).glob('**/*.xml'))
 headers = ['id', 'service_code', 'rational', 'qualify', 'description_num',
 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value',
 'set_data_x']
 # actually call the function
 join_xml_files(xml_files, "output.csv", headers)

As to reducing the file size: that's not possible without removing information. You could zip up the resulting file (it would still be just as large in memory when unzipping it, though), or you can leave out the redundant information (but then all programs reading this file need to fill in the information when necessary).

thank you for your answer, I understand that the size of the file can't be reduced. Really helpful the idea of breaking it down into smaller parts so later on they could be modified easier.
can you give me any advice. Once the csv file is generated I try to read the csv file. I'am getting this error ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2.
@ebe: That sounds like the CSV is malformed, you should take a look directly at the file in a text editor to see exactly what the problem is. Maybe it is newlines, maybe quoting, maybe the separator, it could be many things.
I'm getting an SyntaxError: invalid syntax at the line row = {f"set_data_{key}": value for key, value in data.attrib.items())}
@ebe: Which version of Python are you using? The f-string I used there is Python 3.6+. If you are on an older version, use row = {"set_data_{}".format(key): value for key, value in data.attrib.items())} instead.

Stack Exchange Network

File size when parsing XML

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

File size when parsing XML

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions