Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:41

Community Bot

edited May 23, 2017 at 12:41

Community Bot

Since you're using Python 3, #!/usr/bin/python isn't guaranteed to be it on many distributions, so I'd say using #!/usr/bin/env python3 is a bit safer - that also deals with the binary being in another location.
The last bit after # Clean up and print results should also be indented no? All the variables are only set in the else branch. I'd actually put a sys.exit() in the if part and not have the indentation at all actually.
Take a look at os.path for path manipulation, those functions are portable and a bit more structured than concatenating strings.
I'd use with with the input_file too.
The standard Python interpreter won't extract common subexpressions, so you'll have to and probably should do that yourself - the encode calls and the same occurrences of paths shouldn't be recomputed all the time.
Getting the last item from an iterator is actually a question on StackOverflow, c.f. http://stackoverflow.com/a/2138894/2769043 https://stackoverflow.com/a/2138894/2769043 - perhaps use that.
The comparison line != root_close could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.

Since you're using Python 3, #!/usr/bin/python isn't guaranteed to be it on many distributions, so I'd say using #!/usr/bin/env python3 is a bit safer - that also deals with the binary being in another location.
The last bit after # Clean up and print results should also be indented no? All the variables are only set in the else branch. I'd actually put a sys.exit() in the if part and not have the indentation at all actually.
Take a look at os.path for path manipulation, those functions are portable and a bit more structured than concatenating strings.
I'd use with with the input_file too.
The standard Python interpreter won't extract common subexpressions, so you'll have to and probably should do that yourself - the encode calls and the same occurrences of paths shouldn't be recomputed all the time.
Getting the last item from an iterator is actually a question on StackOverflow, c.f. http://stackoverflow.com/a/2138894/2769043 - perhaps use that.
The comparison line != root_close could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.

Since you're using Python 3, #!/usr/bin/python isn't guaranteed to be it on many distributions, so I'd say using #!/usr/bin/env python3 is a bit safer - that also deals with the binary being in another location.
The last bit after # Clean up and print results should also be indented no? All the variables are only set in the else branch. I'd actually put a sys.exit() in the if part and not have the indentation at all actually.
Take a look at os.path for path manipulation, those functions are portable and a bit more structured than concatenating strings.
I'd use with with the input_file too.
The standard Python interpreter won't extract common subexpressions, so you'll have to and probably should do that yourself - the encode calls and the same occurrences of paths shouldn't be recomputed all the time.
Getting the last item from an iterator is actually a question on StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 - perhaps use that.
The comparison line != root_close could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.

Source Link

answered Sep 3, 2016 at 17:37

ferada

answered Sep 3, 2016 at 17:37

ferada

11.4k
25
65

I think most comments are just going to be about intricacies of Python, the general structure and comments are great.

The only other comment would be that you can probably open the file in binary mode, not do any text encoding conversion and save some processing time that way.

I don't quite get why you have to go through the whole file just to get the closing XML line ... it should be pretty clear what that line is going to be considering the opening XML line? I'd probably change the whole part to just read from the file, accumulate as much as possible and then open the next file instead of going through the file twice.

Since you're using Python 3, #!/usr/bin/python isn't guaranteed to be it on many distributions, so I'd say using #!/usr/bin/env python3 is a bit safer - that also deals with the binary being in another location.
The last bit after # Clean up and print results should also be indented no? All the variables are only set in the else branch. I'd actually put a sys.exit() in the if part and not have the indentation at all actually.
Take a look at os.path for path manipulation, those functions are portable and a bit more structured than concatenating strings.
I'd use with with the input_file too.
The standard Python interpreter won't extract common subexpressions, so you'll have to and probably should do that yourself - the encode calls and the same occurrences of paths shouldn't be recomputed all the time.
Getting the last item from an iterator is actually a question on StackOverflow, c.f. http://stackoverflow.com/a/2138894/2769043 - perhaps use that.
The comparison line != root_close could probably be replaced with a check for the line number compared to the line number of the last line and that should be much faster than string comparison.

Something like this, still same general approach though.

#!/usr/bin/env python3
...
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
full_file_path = file_path + file_name + '.xml'
# get input_file size
file_size_bytes = os.path.getsize(full_file_path)
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
 print('input_file size is less than 2 GB, no splitting needed.')
 print('Path:', full_file_path)
 print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
 sys.exit()
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
with open(full_file_path, 'rb') as input_file:
 # get XML version, opening and closing root nodes,
 # and count the lines in order to determine how many lines to write to each split output input_file
 opening = input_file.readline() + input_file.readline()
 pointer = input_file.tell()
 num_lines_in_file = 2
 for root_close in input_file:
 num_lines_in_file += 1
 num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
 # BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
 # return stream to start of input
 input_file.seek(pointer)
 for current_file_num in range(1, num_split_files_needed + 1):
 full_current_file_path = file_path + file_name + str(current_file_num) + '.xml'
 with open(full_current_file_path, 'w+b') as output_file:
 print('Writing to:', full_current_file_path)
 # write XML header
 output_file.write(opening)
 # start writing lines from the input to the output file
 output_line_num = 1
 for line in input_file:
 # write lines until we reach the num_lines_per_split_file or the end of the input_file
 if output_line_num > num_lines_per_split_file or line == root_close:
 break
 output_file.write(line)
 output_line_num += 1
 # write the footer as the last line in the file
 output_file.write(root_close)
 # move on to the next output file
 current_file_num += 1
 # Clean up and print results
 print('Path:', file_path + file_name)
 print('Size:', file_size_bytes, 'bytes')
 print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
 print('Execution time:', datetime.now() - start)

default