- Since you're using Python 3,
#!/usr/bin/python
isn't guaranteed to be it on many distributions, so I'd say using#!/usr/bin/env python3
is a bit safer - that also deals with the binary being in another location. - The last bit after
# Clean up and print results
should also be indented no? All the variables are only set in theelse
branch. I'd actually put asys.exit()
in theif
part and not have the indentation at all actually. - Take a look at
os.path
for path manipulation, those functions are portable and a bit more structured than concatenating strings. - I'd use
with
with theinput_file
too. - The standard Python interpreter won't extract common subexpressions,
so you'll have to and probably should do that yourself - the
encode
calls and the same occurrences of paths shouldn't be recomputed all the time. - Getting the last item from an iterator is actually a question on StackOverflow, c.f. http://stackoverflow.com/a/2138894/2769043 https://stackoverflow.com/a/2138894/2769043 - perhaps use that.
- The comparison
line != root_close
could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.
- Since you're using Python 3,
#!/usr/bin/python
isn't guaranteed to be it on many distributions, so I'd say using#!/usr/bin/env python3
is a bit safer - that also deals with the binary being in another location. - The last bit after
# Clean up and print results
should also be indented no? All the variables are only set in theelse
branch. I'd actually put asys.exit()
in theif
part and not have the indentation at all actually. - Take a look at
os.path
for path manipulation, those functions are portable and a bit more structured than concatenating strings. - I'd use
with
with theinput_file
too. - The standard Python interpreter won't extract common subexpressions,
so you'll have to and probably should do that yourself - the
encode
calls and the same occurrences of paths shouldn't be recomputed all the time. - Getting the last item from an iterator is actually a question on StackOverflow, c.f. http://stackoverflow.com/a/2138894/2769043 - perhaps use that.
- The comparison
line != root_close
could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.
- Since you're using Python 3,
#!/usr/bin/python
isn't guaranteed to be it on many distributions, so I'd say using#!/usr/bin/env python3
is a bit safer - that also deals with the binary being in another location. - The last bit after
# Clean up and print results
should also be indented no? All the variables are only set in theelse
branch. I'd actually put asys.exit()
in theif
part and not have the indentation at all actually. - Take a look at
os.path
for path manipulation, those functions are portable and a bit more structured than concatenating strings. - I'd use
with
with theinput_file
too. - The standard Python interpreter won't extract common subexpressions,
so you'll have to and probably should do that yourself - the
encode
calls and the same occurrences of paths shouldn't be recomputed all the time. - Getting the last item from an iterator is actually a question on StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 - perhaps use that.
- The comparison
line != root_close
could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.
I think most comments are just going to be about intricacies of Python, the general structure and comments are great.
The only other comment would be that you can probably open the file in binary mode, not do any text encoding conversion and save some processing time that way.
I don't quite get why you have to go through the whole file just to get the closing XML line ... it should be pretty clear what that line is going to be considering the opening XML line? I'd probably change the whole part to just read from the file, accumulate as much as possible and then open the next file instead of going through the file twice.
- Since you're using Python 3,
#!/usr/bin/python
isn't guaranteed to be it on many distributions, so I'd say using#!/usr/bin/env python3
is a bit safer - that also deals with the binary being in another location. - The last bit after
# Clean up and print results
should also be indented no? All the variables are only set in theelse
branch. I'd actually put asys.exit()
in theif
part and not have the indentation at all actually. - Take a look at
os.path
for path manipulation, those functions are portable and a bit more structured than concatenating strings. - I'd use
with
with theinput_file
too. - The standard Python interpreter won't extract common subexpressions,
so you'll have to and probably should do that yourself - the
encode
calls and the same occurrences of paths shouldn't be recomputed all the time. - Getting the last item from an iterator is actually a question on StackOverflow, c.f. http://stackoverflow.com/a/2138894/2769043 - perhaps use that.
- The comparison
line != root_close
could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.
Something like this, still same general approach though.
#!/usr/bin/env python3
...
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
full_file_path = file_path + file_name + '.xml'
# get input_file size
file_size_bytes = os.path.getsize(full_file_path)
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
print('input_file size is less than 2 GB, no splitting needed.')
print('Path:', full_file_path)
print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
sys.exit()
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
with open(full_file_path, 'rb') as input_file:
# get XML version, opening and closing root nodes,
# and count the lines in order to determine how many lines to write to each split output input_file
opening = input_file.readline() + input_file.readline()
pointer = input_file.tell()
num_lines_in_file = 2
for root_close in input_file:
num_lines_in_file += 1
num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
# BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
# return stream to start of input
input_file.seek(pointer)
for current_file_num in range(1, num_split_files_needed + 1):
full_current_file_path = file_path + file_name + str(current_file_num) + '.xml'
with open(full_current_file_path, 'w+b') as output_file:
print('Writing to:', full_current_file_path)
# write XML header
output_file.write(opening)
# start writing lines from the input to the output file
output_line_num = 1
for line in input_file:
# write lines until we reach the num_lines_per_split_file or the end of the input_file
if output_line_num > num_lines_per_split_file or line == root_close:
break
output_file.write(line)
output_line_num += 1
# write the footer as the last line in the file
output_file.write(root_close)
# move on to the next output file
current_file_num += 1
# Clean up and print results
print('Path:', file_path + file_name)
print('Size:', file_size_bytes, 'bytes')
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
print('Execution time:', datetime.now() - start)