The code below takes a directory of XML files and parses them into a CSV file. Right now parsing around 60 XML files is fast and the output is a CSV file that is around 250MB.
That is a really big file and the reason is because columns are being repeated. I'm repeating the columns for the reason that every element should have all the information. In red is one of the cases where the ID Z048 had multiple lines of setdata
so that is why the other columns in red had to be repeated.
I'm planning to increase the number of XML files to 5k, meaning that the CSV file will be relatively large.
Asking this question to maybe get any answer if the size of my CSV file can be lowered somehow. Even though I tried to code with the mindset that I want my code to be fast and produce not too big CSV files.
from xml.etree import ElementTree as ET
from collections import defaultdict
import csv
from pathlib import Path
directory = 'path to a folder with xml files'
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
writer.writerow(headers)
xml_files_list = list(map(str, Path(directory).glob('**/*.xml')))
print(xml_files_list)
for xml_file in xml_files_list:
tree = ET.parse(xml_file)
root = tree.getroot()
start_nodes = root.findall('.//START')
for sn in start_nodes:
row = defaultdict(str)
repeated_values = dict()
for k,v in sn.attrib.items():
repeated_values[k] = v
for rn in sn.findall('.//Rational'):
repeated_values['rational'] = rn.text
for qu in sn.findall('.//Qualify'):
repeated_values['qualify'] = qu.text
for ds in sn.findall('.//Description'):
repeated_values['description_txt'] = ds.text
repeated_values['description_num'] = ds.attrib['num']
for st in sn.findall('.//SetData'):
for k,v in st.attrib.items():
row['set_data_'+ str(k)] = v
for key in repeated_values.keys():
row[key] = repeated_values[key]
row_data = [row[i] for i in headers]
writer.writerow(row_data)
row = defaultdict(str)
1 Answer 1
At this point your code is starting to become hard to follow. If you add any more complicated logic, you will not be able to easily understand it yourself if you return to it a few weeks later.
I would first of all separate out the parsing of an XML file so that you can change that independently of all the rest. You can make this a generator, so it doesn't need to know what happens with the data it provides (being written to a CSV file, printed to the terminal, shucked into /dev/null/
, ...).
Using findall
to find one element is not the right approach. Either there is exactly one e.g. './/Rational'
element, in which case you can just use find
, or there is not and you have to store all of them (or rethink your XML design).
You can use a dictionary comprehension and the relatively new f-string
s to easily generate your row data (don't worry about missing keys here, we'll take care of them outside).
def parse_file(xml_file):
tree = ET.parse(xml_file)
for node in tree.getroot().findall('.//START'):
repeated_values = dict(node.attrib)
repeated_values['rational'] = node.find('.//Rational').text
repeated_values['qualify'] = node.find('.//Qualify').text
description = node.find('.//Description')
repeated_values['description_txt'] = description.text
repeated_values['description_num'] = description.get('num')
for data in node.findall('.//SetData'):
row = {f"set_data_{key}": value for key, value in data.attrib.items())}
row.update(repeated_values)
yield row
The csv
module has a DictReader
and, more importantly, a DictWriter
class. These are a bit easier to use since you don't need to enforce the correct order in the parsing. It also has the writerows
method (like all other csv
writers) that can take an iterable of dictionaries to write. This is slightly faster than writing each row manually, and also more readable.
def join_xml_files(files, output_file, headers):
with open(output_file, 'w') as f:
writer = csv.DictWriter(f, headers)
writer.writeheader()
for xml_file in files:
writer.writerows(parse_file(xml_file))
Note that restvals=""
is already set by default and replaces missing keys with an empty string, so no need for collections.defaultdict(str)
anymore.
You can then put the actual calling code under a if __name__ == "__main__":
guard which allows you to import from this script without running it:
if __name__ == "__main__":
directory = 'path to a folder with xml files'
xml_files = map(str, Path(directory).glob('**/*.xml'))
headers = ['id', 'service_code', 'rational', 'qualify', 'description_num',
'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value',
'set_data_x']
# actually call the function
join_xml_files(xml_files, "output.csv", headers)
As to reducing the file size: that's not possible without removing information. You could zip up the resulting file (it would still be just as large in memory when unzipping it, though), or you can leave out the redundant information (but then all programs reading this file need to fill in the information when necessary).
-
\$\begingroup\$ thank you for your answer, I understand that the size of the file can't be reduced. Really helpful the idea of breaking it down into smaller parts so later on they could be modified easier. \$\endgroup\$ebe– ebe2020年01月23日 14:22:27 +00:00Commented Jan 23, 2020 at 14:22
-
\$\begingroup\$ can you give me any advice. Once the csv file is generated I try to read the csv file. I'am getting this error
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2
. \$\endgroup\$ebe– ebe2020年01月23日 20:44:57 +00:00Commented Jan 23, 2020 at 20:44 -
\$\begingroup\$ @ebe: That sounds like the CSV is malformed, you should take a look directly at the file in a text editor to see exactly what the problem is. Maybe it is newlines, maybe quoting, maybe the separator, it could be many things. \$\endgroup\$Graipher– Graipher2020年01月24日 09:49:30 +00:00Commented Jan 24, 2020 at 9:49
-
\$\begingroup\$ I'm getting an SyntaxError: invalid syntax at the line row = {f"set_data_{key}": value for key, value in data.attrib.items())} \$\endgroup\$ebe– ebe2020年01月27日 16:51:52 +00:00Commented Jan 27, 2020 at 16:51
-
\$\begingroup\$ @ebe: Which version of Python are you using? The
f-string
I used there is Python 3.6+. If you are on an older version, userow = {"set_data_{}".format(key): value for key, value in data.attrib.items())}
instead. \$\endgroup\$Graipher– Graipher2020年01月28日 09:43:30 +00:00Commented Jan 28, 2020 at 9:43
pandas
tag, because it is not used anywhere in the code and fixed up some grammar. Feel free to further edit in case any of the changes changed your intention. \$\endgroup\$