I was posed with a challenge when trying to load XML files from Stack Exchange into a SQL Server database; some of these XML files are extremely large (largest one being a whopping 67 GB in a single XML file), so much so that the database just cannot handle them without throwing a System.OutOfMemoryException
.
After some testing, I found out the optimal XML size for the database to handle efficiently (at least on my database) is ~20 MB. So I wrote this Python script to slice up any file larger than 20 MB into more-or-less equal files < 20 MB each.
It works pretty good, but my code looks very procedural, and since I am very inexperienced with Python I am not certain how it could be improved. I have added documentation throughout to make the script easy to follow, hopefully.
I am primarily interested in improving the performance, although all improvement suggestions are appreciated. Note that I did not use an XML module due to the very simple structure of these XML files, that is why I decided to just parse them as regular text. I used UTF-8 encoding for the output files to match the input.
The script prints this summary at the end:
Path: D:\Downloads\stackexchange\stackoverflow.com\Badges Size: 2016042158 bytes 17207172 total lines, split into 101 files = 170369 lines per input_file. Execution time: 0:01:16.416938
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import math
from datetime import datetime
__author__ = 'https://github.com/Phrancis'
'''
The purpose of this script is to split up very large XML files from the Stack Exchange
public data dump into small files which can then efficiently be loaded into a Microsoft SQL Server
database for further processing.
Testing with SQL Server on the author's database has proven that XML file sizes of ~20 MB
have reasonable performance and processing time. Larger files have led to System.OutOfMemoryException
on the database. The size of output files can be adjusted by changing the value of
the SQL_SERVER_XML_SIZE_LIMIT variable.
NOTE: This script was made using Python 3.5 and is not compatible with Python 2.x.
This script assumes the format of the Stack Exchange XML files to be as following,
and will not work correctly with differently-formatted XML.
<?xml version="1.0" encoding="utf-8"?>
<root_name>
<row Id="1" ... />
<row Id="2" ... />
<row Id="3" ... />
...
<row Id="176685" ... />
</root_name>
'''
# Clock to measure how long the script takes to execute
start = datetime.now()
SQL_SERVER_XML_SIZE_LIMIT = 20000000
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
# get input_file size
file_size_bytes = os.path.getsize(file_path + file_name + '.xml')
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
print('input_file size is less than 2 GB, no splitting needed.')
print('Path:', file_path + file_name + '.xml')
print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
else:
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
input_file = open(file_path + file_name + '.xml', 'r', encoding='utf-8')
# get XML version, opening and closing root nodes,
# and count the lines in order to determine how many lines to write to each split output input_file
xml_ver = input_file.readline()
root_open = input_file.readline()
num_lines_in_file = 2
for line in input_file:
root_close = line
num_lines_in_file += 1
num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
# BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
# return stream to start of input_file
input_file.seek(0)
# skip top 2 lines of input_file as they contain the xml_var and root_open
input_file.readline()
input_file.readline()
for current_file_num in range(1, num_split_files_needed + 1):
with open(file_path + file_name + str(current_file_num) + '.xml', 'w+b') as output_file:
print('Writing to:', file_path + file_name + str(current_file_num) + '.xml')
# write XML header
output_file.write(xml_ver.encode())
output_file.write(root_open.encode())
# start writing lines from the input to the output file
output_line_num = 1
for line in input_file:
# write lines until we reach the num_lines_per_split_file or the end of the input_file
if output_line_num <= num_lines_per_split_file and line != root_close:
output_file.write(line.encode())
output_line_num += 1
else:
break
# write the footer as the last line in the file
output_file.write(root_close.encode())
# move on to the next output file
current_file_num += 1
# Clean up and print results
input_file.close()
print('Path:', file_path + file_name)
print('Size:', file_size_bytes, 'bytes')
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
print('Execution time:', datetime.now() - start)
2 Answers 2
I think most comments are just going to be about intricacies of Python, the general structure and comments are great.
The only other comment would be that you can probably open the file in binary mode, not do any text encoding conversion and save some processing time that way.
I don't quite get why you have to go through the whole file just to get the closing XML line ... it should be pretty clear what that line is going to be considering the opening XML line? I'd probably change the whole part to just read from the file, accumulate as much as possible and then open the next file instead of going through the file twice.
- Since you're using Python 3,
#!/usr/bin/python
isn't guaranteed to be it on many distributions, so I'd say using#!/usr/bin/env python3
is a bit safer - that also deals with the binary being in another location. - The last bit after
# Clean up and print results
should also be indented no? All the variables are only set in theelse
branch. I'd actually put asys.exit()
in theif
part and not have the indentation at all actually. - Take a look at
os.path
for path manipulation, those functions are portable and a bit more structured than concatenating strings. - I'd use
with
with theinput_file
too. - The standard Python interpreter won't extract common subexpressions,
so you'll have to and probably should do that yourself - the
encode
calls and the same occurrences of paths shouldn't be recomputed all the time. - Getting the last item from an iterator is actually a question on StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 - perhaps use that.
- The comparison
line != root_close
could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.
Something like this, still same general approach though.
#!/usr/bin/env python3
...
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
full_file_path = file_path + file_name + '.xml'
# get input_file size
file_size_bytes = os.path.getsize(full_file_path)
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
print('input_file size is less than 2 GB, no splitting needed.')
print('Path:', full_file_path)
print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
sys.exit()
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
with open(full_file_path, 'rb') as input_file:
# get XML version, opening and closing root nodes,
# and count the lines in order to determine how many lines to write to each split output input_file
opening = input_file.readline() + input_file.readline()
pointer = input_file.tell()
num_lines_in_file = 2
for root_close in input_file:
num_lines_in_file += 1
num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
# BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
# return stream to start of input
input_file.seek(pointer)
for current_file_num in range(1, num_split_files_needed + 1):
full_current_file_path = file_path + file_name + str(current_file_num) + '.xml'
with open(full_current_file_path, 'w+b') as output_file:
print('Writing to:', full_current_file_path)
# write XML header
output_file.write(opening)
# start writing lines from the input to the output file
output_line_num = 1
for line in input_file:
# write lines until we reach the num_lines_per_split_file or the end of the input_file
if output_line_num > num_lines_per_split_file or line == root_close:
break
output_file.write(line)
output_line_num += 1
# write the footer as the last line in the file
output_file.write(root_close)
# move on to the next output file
current_file_num += 1
# Clean up and print results
print('Path:', file_path + file_name)
print('Size:', file_size_bytes, 'bytes')
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
print('Execution time:', datetime.now() - start)
-
1\$\begingroup\$ There is a problem in the refactored code, this line
with open(full_current_file_path, 'w+b') as output_file
would overwrite data from the input file, my version uses a different file name with addedstr(current_file_num)
(this will now also get a code comment :) \$\endgroup\$Phrancis– Phrancis2016年09月04日 14:45:02 +00:00Commented Sep 4, 2016 at 14:45 -
\$\begingroup\$ I don't quite understand that comment,
full_current_file_path
is set in each iteration based on thecurrent_file_num
as well, I just extracted the duplicated code for it from the given code, so the semantics shouldn't have changed? \$\endgroup\$ferada– ferada2016年09月05日 16:50:39 +00:00Commented Sep 5, 2016 at 16:50
Consider actually using an xml module, specifically etree's iterparse()
which is used for large XML files to iteratively read line by line without reading whole document in memory at once. Additionally, you do not treat the XML as a text file, writing in headers and closing tags as dedicated methods of .append()
and .write()
are used with handling of encoding.
#!/usr/bin/python
import os
import math
from datetime import datetime
import xml.etree.ElementTree as et
# Clock to measure how long the script takes to execute
start = datetime.now()
SQL_SERVER_XML_SIZE_LIMIT = 20000000
file_path = 'D:\\Downloads\\stackexchange\\stackoverflow.com\\'
file_name = 'Badges'
file_size_bytes = os.path.getsize(file_path + file_name + '.xml')
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
print('input_file size is less than 2 GB, no splitting needed.')
print('Path:', file_path + file_name + '.xml')
print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
else:
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
# determine how many lines to write to each split output input_file
num_lines_in_file = 2
with open(file_path + file_name + '.xml', 'r', encoding='utf-8') as f:
for line in f:
num_lines_in_file += 1
num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
# ITERATIVELY READ LINES IN XML FILE
i = 0; current_file_num = 0
root = et.Element('root')
for (ev, el) in et.iterparse(file_path + file_name + '.xml'):
i += 1
if el.tag == 'row':
root.append(el)
# LINES PER FILE (BY MULTIPLE OF num_lines_per_split_file)
if i % num_lines_per_split_file == 0:
current_file_num += 1
tree_out = et.ElementTree(root)
tree_out.write(file_path + file_name + str(current_file_num) + '.xml',
encoding='utf-8', xml_declaration=True)
root = et.Element('root')
# REMAINING LINES (AFTER LAST MULTIPLE OF num_lines_per_split_file)
# (5 = XML DECL, ROOT OPEN/CLOSE + num_lines start + i start = 1 + 2 + 1 + 1)
if i == num_lines_in_file - 5:
current_file_num += 1
tree_out = et.ElementTree(root)
tree_out.write(file_path + file_name + str(current_file_num) + '.xml',
encoding='utf-8', xml_declaration=True)
# Clean up and print results
print('Path:', file_path + file_name)
print('Size:', file_size_bytes, 'bytes')
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =',
num_lines_per_split_file, 'lines per input_file.')
print('Execution time:', datetime.now() - start)
-
\$\begingroup\$ Isn't the issue that the database can't load the XML? Then it doesn't matter much whether the Python script uses an iterative approach - it's unlikely to be faster than not parsing the content at all. \$\endgroup\$ferada– ferada2016年09月05日 16:55:51 +00:00Commented Sep 5, 2016 at 16:55
-
\$\begingroup\$ @ferada - I do not understand your comment. This approach is an alternate to yours and OP's but using an xml module all to split the large XML. I often advise not treating XML as a text file like writing in tags as you may disrupt well-formed DOM structures, entities, whitespace, even encoding. Just writing in the header string doesn't make it utf-8 encoded. \$\endgroup\$Parfait– Parfait2016年09月05日 17:40:56 +00:00Commented Sep 5, 2016 at 17:40
-
\$\begingroup\$ Of course, I'm assuming @Phrancis made sure that the files are actually formatted in a way that makes the pure binary approach viable, so nothing split across lines etc. \$\endgroup\$ferada– ferada2016年09月05日 17:53:06 +00:00Commented Sep 5, 2016 at 17:53
Explore related questions
See similar questions with these tags.