Slicing Stack Exchange data dump XML files into small bites

Question 1

I was posed with a challenge when trying to load XML files from Stack Exchange into a SQL Server database; some of these XML files are extremely large (largest one being a whopping 67 GB in a single XML file), so much so that the database just cannot handle them without throwing a System.OutOfMemoryException.

After some testing, I found out the optimal XML size for the database to handle efficiently (at least on my database) is ~20 MB. So I wrote this Python script to slice up any file larger than 20 MB into more-or-less equal files < 20 MB each.

It works pretty good, but my code looks very procedural, and since I am very inexperienced with Python I am not certain how it could be improved. I have added documentation throughout to make the script easy to follow, hopefully.

I am primarily interested in improving the performance, although all improvement suggestions are appreciated. Note that I did not use an XML module due to the very simple structure of these XML files, that is why I decided to just parse them as regular text. I used UTF-8 encoding for the output files to match the input.

The script prints this summary at the end:

Path: D:\Downloads\stackexchange\stackoverflow.com\Badges
Size: 2016042158 bytes
17207172 total lines, split into 101 files = 170369 lines per input_file.
Execution time: 0:01:16.416938

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import math
from datetime import datetime
__author__ = 'https://github.com/Phrancis'
'''
The purpose of this script is to split up very large XML files from the Stack Exchange
public data dump into small files which can then efficiently be loaded into a Microsoft SQL Server
database for further processing.
Testing with SQL Server on the author's database has proven that XML file sizes of ~20 MB
have reasonable performance and processing time. Larger files have led to System.OutOfMemoryException
on the database. The size of output files can be adjusted by changing the value of
the SQL_SERVER_XML_SIZE_LIMIT variable.
NOTE: This script was made using Python 3.5 and is not compatible with Python 2.x.
This script assumes the format of the Stack Exchange XML files to be as following,
and will not work correctly with differently-formatted XML.
<?xml version="1.0" encoding="utf-8"?>
<root_name>
 <row Id="1" ... />
 <row Id="2" ... />
 <row Id="3" ... />
 ...
 <row Id="176685" ... />
</root_name>
'''
# Clock to measure how long the script takes to execute
start = datetime.now()
SQL_SERVER_XML_SIZE_LIMIT = 20000000
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
# get input_file size
file_size_bytes = os.path.getsize(file_path + file_name + '.xml')
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
 print('input_file size is less than 2 GB, no splitting needed.')
 print('Path:', file_path + file_name + '.xml')
 print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
else:
 num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT) 
 input_file = open(file_path + file_name + '.xml', 'r', encoding='utf-8')
 # get XML version, opening and closing root nodes,
 # and count the lines in order to determine how many lines to write to each split output input_file
 xml_ver = input_file.readline()
 root_open = input_file.readline()
 num_lines_in_file = 2
 for line in input_file:
 root_close = line
 num_lines_in_file += 1
 num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
 # BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
 # return stream to start of input_file
 input_file.seek(0)
 # skip top 2 lines of input_file as they contain the xml_var and root_open
 input_file.readline()
 input_file.readline()
 for current_file_num in range(1, num_split_files_needed + 1):
 with open(file_path + file_name + str(current_file_num) + '.xml', 'w+b') as output_file:
 print('Writing to:', file_path + file_name + str(current_file_num) + '.xml')
 # write XML header
 output_file.write(xml_ver.encode())
 output_file.write(root_open.encode())
 # start writing lines from the input to the output file
 output_line_num = 1
 for line in input_file:
 # write lines until we reach the num_lines_per_split_file or the end of the input_file
 if output_line_num <= num_lines_per_split_file and line != root_close:
 output_file.write(line.encode())
 output_line_num += 1
 else:
 break
 # write the footer as the last line in the file
 output_file.write(root_close.encode())
 # move on to the next output file
 current_file_num += 1
# Clean up and print results
input_file.close()
print('Path:', file_path + file_name)
print('Size:', file_size_bytes, 'bytes')
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
print('Execution time:', datetime.now() - start)

Question 2

I think most comments are just going to be about intricacies of Python, the general structure and comments are great.

The only other comment would be that you can probably open the file in binary mode, not do any text encoding conversion and save some processing time that way.

I don't quite get why you have to go through the whole file just to get the closing XML line ... it should be pretty clear what that line is going to be considering the opening XML line? I'd probably change the whole part to just read from the file, accumulate as much as possible and then open the next file instead of going through the file twice.

Since you're using Python 3, #!/usr/bin/python isn't guaranteed to be it on many distributions, so I'd say using #!/usr/bin/env python3 is a bit safer - that also deals with the binary being in another location.
The last bit after # Clean up and print results should also be indented no? All the variables are only set in the else branch. I'd actually put a sys.exit() in the if part and not have the indentation at all actually.
Take a look at os.path for path manipulation, those functions are portable and a bit more structured than concatenating strings.
I'd use with with the input_file too.
The standard Python interpreter won't extract common subexpressions, so you'll have to and probably should do that yourself - the encode calls and the same occurrences of paths shouldn't be recomputed all the time.
Getting the last item from an iterator is actually a question on StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 - perhaps use that.
The comparison line != root_close could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.

Something like this, still same general approach though.

#!/usr/bin/env python3
...
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
full_file_path = file_path + file_name + '.xml'
# get input_file size
file_size_bytes = os.path.getsize(full_file_path)
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
 print('input_file size is less than 2 GB, no splitting needed.')
 print('Path:', full_file_path)
 print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
 sys.exit()
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
with open(full_file_path, 'rb') as input_file:
 # get XML version, opening and closing root nodes,
 # and count the lines in order to determine how many lines to write to each split output input_file
 opening = input_file.readline() + input_file.readline()
 pointer = input_file.tell()
 num_lines_in_file = 2
 for root_close in input_file:
 num_lines_in_file += 1
 num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
 # BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
 # return stream to start of input
 input_file.seek(pointer)
 for current_file_num in range(1, num_split_files_needed + 1):
 full_current_file_path = file_path + file_name + str(current_file_num) + '.xml'
 with open(full_current_file_path, 'w+b') as output_file:
 print('Writing to:', full_current_file_path)
 # write XML header
 output_file.write(opening)
 # start writing lines from the input to the output file
 output_line_num = 1
 for line in input_file:
 # write lines until we reach the num_lines_per_split_file or the end of the input_file
 if output_line_num > num_lines_per_split_file or line == root_close:
 break
 output_file.write(line)
 output_line_num += 1
 # write the footer as the last line in the file
 output_file.write(root_close)
 # move on to the next output file
 current_file_num += 1
 # Clean up and print results
 print('Path:', file_path + file_name)
 print('Size:', file_size_bytes, 'bytes')
 print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
 print('Execution time:', datetime.now() - start)

Question 3

There is a problem in the refactored code, this line with open(full_current_file_path, 'w+b') as output_file would overwrite data from the input file, my version uses a different file name with added str(current_file_num) (this will now also get a code comment :)

Question 4

I don't quite understand that comment, full_current_file_path is set in each iteration based on the current_file_num as well, I just extracted the duplicated code for it from the given code, so the semantics shouldn't have changed?

Question 5

Consider actually using an xml module, specifically etree's iterparse() which is used for large XML files to iteratively read line by line without reading whole document in memory at once. Additionally, you do not treat the XML as a text file, writing in headers and closing tags as dedicated methods of .append() and .write() are used with handling of encoding.

#!/usr/bin/python
import os
import math
from datetime import datetime
import xml.etree.ElementTree as et
# Clock to measure how long the script takes to execute
start = datetime.now()
SQL_SERVER_XML_SIZE_LIMIT = 20000000
file_path = 'D:\\Downloads\\stackexchange\\stackoverflow.com\\'
file_name = 'Badges'
file_size_bytes = os.path.getsize(file_path + file_name + '.xml')
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
 print('input_file size is less than 2 GB, no splitting needed.')
 print('Path:', file_path + file_name + '.xml')
 print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
else:
 num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT) 
 # determine how many lines to write to each split output input_file
 num_lines_in_file = 2
 with open(file_path + file_name + '.xml', 'r', encoding='utf-8') as f:
 for line in f:
 num_lines_in_file += 1
 num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
 # ITERATIVELY READ LINES IN XML FILE 
 i = 0; current_file_num = 0
 root = et.Element('root')
 for (ev, el) in et.iterparse(file_path + file_name + '.xml'):
 i += 1
 if el.tag == 'row': 
 root.append(el)
 # LINES PER FILE (BY MULTIPLE OF num_lines_per_split_file)
 if i % num_lines_per_split_file == 0:
 current_file_num += 1
 tree_out = et.ElementTree(root)
 tree_out.write(file_path + file_name + str(current_file_num) + '.xml',
 encoding='utf-8', xml_declaration=True)
 root = et.Element('root')
 # REMAINING LINES (AFTER LAST MULTIPLE OF num_lines_per_split_file)
 # (5 = XML DECL, ROOT OPEN/CLOSE + num_lines start + i start = 1 + 2 + 1 + 1)
 if i == num_lines_in_file - 5: 
 current_file_num += 1 
 tree_out = et.ElementTree(root)
 tree_out.write(file_path + file_name + str(current_file_num) + '.xml',
 encoding='utf-8', xml_declaration=True)
 # Clean up and print results
 print('Path:', file_path + file_name)
 print('Size:', file_size_bytes, 'bytes')
 print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =',
 num_lines_per_split_file, 'lines per input_file.')
 print('Execution time:', datetime.now() - start)

Question 6

Isn't the issue that the database can't load the XML? Then it doesn't matter much whether the Python script uses an iterative approach - it's unlikely to be faster than not parsing the content at all.

Question 7

@ferada - I do not understand your comment. This approach is an alternate to yours and OP's but using an xml module all to split the large XML. I often advise not treating XML as a text file like writing in tags as you may disrupt well-formed DOM structures, entities, whitespace, even encoding. Just writing in the header string doesn't make it utf-8 encoded.

Question 8

Of course, I'm assuming @Phrancis made sure that the files are actually formatted in a way that makes the pure binary approach viable, so nothing split across lines etc.

ferada feradaferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2016-09-03 17:37:46Z

I think most comments are just going to be about intricacies of Python, the general structure and comments are great.

The only other comment would be that you can probably open the file in binary mode, not do any text encoding conversion and save some processing time that way.

I don't quite get why you have to go through the whole file just to get the closing XML line ... it should be pretty clear what that line is going to be considering the opening XML line? I'd probably change the whole part to just read from the file, accumulate as much as possible and then open the next file instead of going through the file twice.

Since you're using Python 3, #!/usr/bin/python isn't guaranteed to be it on many distributions, so I'd say using #!/usr/bin/env python3 is a bit safer - that also deals with the binary being in another location.
The last bit after # Clean up and print results should also be indented no? All the variables are only set in the else branch. I'd actually put a sys.exit() in the if part and not have the indentation at all actually.
Take a look at os.path for path manipulation, those functions are portable and a bit more structured than concatenating strings.
I'd use with with the input_file too.
The standard Python interpreter won't extract common subexpressions, so you'll have to and probably should do that yourself - the encode calls and the same occurrences of paths shouldn't be recomputed all the time.
Getting the last item from an iterator is actually a question on StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 - perhaps use that.
The comparison line != root_close could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.

Something like this, still same general approach though.

#!/usr/bin/env python3
...
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
full_file_path = file_path + file_name + '.xml'
# get input_file size
file_size_bytes = os.path.getsize(full_file_path)
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
 print('input_file size is less than 2 GB, no splitting needed.')
 print('Path:', full_file_path)
 print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
 sys.exit()
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
with open(full_file_path, 'rb') as input_file:
 # get XML version, opening and closing root nodes,
 # and count the lines in order to determine how many lines to write to each split output input_file
 opening = input_file.readline() + input_file.readline()
 pointer = input_file.tell()
 num_lines_in_file = 2
 for root_close in input_file:
 num_lines_in_file += 1
 num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
 # BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
 # return stream to start of input
 input_file.seek(pointer)
 for current_file_num in range(1, num_split_files_needed + 1):
 full_current_file_path = file_path + file_name + str(current_file_num) + '.xml'
 with open(full_current_file_path, 'w+b') as output_file:
 print('Writing to:', full_current_file_path)
 # write XML header
 output_file.write(opening)
 # start writing lines from the input to the output file
 output_line_num = 1
 for line in input_file:
 # write lines until we reach the num_lines_per_split_file or the end of the input_file
 if output_line_num > num_lines_per_split_file or line == root_close:
 break
 output_file.write(line)
 output_line_num += 1
 # write the footer as the last line in the file
 output_file.write(root_close)
 # move on to the next output file
 current_file_num += 1
 # Clean up and print results
 print('Path:', file_path + file_name)
 print('Size:', file_size_bytes, 'bytes')
 print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
 print('Execution time:', datetime.now() - start)

There is a problem in the refactored code, this line with open(full_current_file_path, 'w+b') as output_file would overwrite data from the input file, my version uses a different file name with added str(current_file_num) (this will now also get a code comment :)
I don't quite understand that comment, full_current_file_path is set in each iteration based on the current_file_num as well, I just extracted the duplicated code for it from the given code, so the semantics shouldn't have changed?

Stack Exchange Network

Slicing Stack Exchange data dump XML files into small bites

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Slicing Stack Exchange data dump XML files into small bites

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions