I've written a quick script for a coworker to split a large CAN log into smaller chunks. (If you're not familiar with CAN, it's a communication protocol used by the ECUs in many cars.) I know where to split because I've inserted dummy CAN messages (with ID 0x00) at the start of each section, and one at the end of testing (which may be somewhere in the middle of the log) to tell me when to stop reading.
The log is in .asc or .csv format, and can be several gigabytes in size. Currently I can process a 1.5GB file in about 40 seconds, but I'm sure that can be improved. I'm looking more for advice on how to speed this up than to make it more Pythonic, but of course criticism is welcome in both areas.
Note: titles
is a dictionary mapping section numbers to a particular string that needs to be added to the filename before saving. I can add the code for generating these, but I don't believe it's as relevant.
def split_asc_file(target_file, target_dir, titles):
import os
import time
start_time = time.time()
if not os.path.isdir(target_dir):
os.mkdir(target_dir)
os.chdir(target_dir)
section = None
def create_title(message_string):
req_num = int(message_string[0:8])
obj_num = int(message_string[8:16])
if req_num == 0 and obj_num == 0:
print "Splitting completed in {} seconds".format(time.time() - start_time)
quit() # final test has been executed
else:
at = "AT{}_{}".format(req_num, obj_num)
title_prefix = titles[at]
title_string = "{}_{}.asc".format(title_prefix, at)
return title_string
def can_traffic_only(f):
# iterate only over lines that contain messages
for line in f:
if len(line.split()) == 14:
yield line
with open(target_file) as log:
print "Opening {}...".format(target_file)
for message in can_traffic_only(log):
values = message.split()
can_id = values[2]
can_data = "".join(values[6:])
if can_id == "0":
if section:
section.close()
title = create_title(can_data)
if title:
print "Creating {}".format(title)
section = open(title, "w")
else:
if section:
section.write(message)
print "Splitting completed in {} seconds".format(time.time() - start_time)
2 Answers 2
What I get from your code is that, you skip messages until the first dummy one which indicate the first section and then you have the following cycle:
- Extract title information out of the dummy message;
- Open a file to extract out messages of this section into it;
- Write relevant messages until the next dummy one.
Reorganizing your code to follow this layout more closely can lead you to remove you if section
tests which are executed at each line and may be slowing thing a bit.
You can also remove your if title
since create_title
will never return anything other than a string of more than 5 characters. But I guess that it was used before to check for the end of the tests and I’ll reuse that.
By combining that with proposals by @ferada, you can end up with:
import os
import time
def create_title(message_string, titles):
req_num = int(message_string[0:8])
obj_num = int(message_string[8:16])
if not req_num and not obj_num:
return
at = "AT{}_{}".format(req_num, obj_num)
title_prefix = titles[at]
return "{}_{}.asc".format(title_prefix, at)
def split_asc_file(target_file, target_dir, titles):
if not os.path.isdir(target_dir):
os.makedirs(target_dir)
os.chdir(target_dir)
with open(target_file) as log:
print 'Opening', target_file
# Bootstrap
for message in log:
data = message.split()
if len(data) == 14 and data[2] == "0":
break
while True:
# Using message rather than reusing data here; see next comment
data = message.split()[6:]
title = create_title(''.join(data), titles)
if title is None:
break
with open(title, 'w') as section:
print 'Created', title
for message in log:
# Knowing the input format, you should be able to extract
# the same information than the next two ifs by analyzing
# message rather than splitting it, as ferada suggested
data = message.split()
if len(data) == 14:
if data[2] == "0":
break
section.write(message)
if __name__ == '__main__':
start_time = time.time()
split_asc_file(..,..,..) #Whatever
print "Splitting completed in {} seconds".format(time.time() - start_time)
The workflow I proposed let you also open the section
file using a with
statement which is prefered in python. I also changed mkdir
in makedirs
, just in case.
That looks like it's close to what you're going to get with Python I think.
I'd suggest taking a profiler and optimising according to that; e.g. I can imagine that doing less work using split
and instead just counting the number of spaces (instead of allocating all the results) should be a bit faster (in can_traffic_only
).
can_data
can be delayed till the condition for the if
block is true, but again, depends on how often that's the case.
If there's nothing else you could inline can_traffic_only
and see if that makes a difference.
can_traffic_only
you split a line and check for the number of parts and in the other part you split that line again.can_traffic_only
could return the list of parts so that the second split can be eliminated. \$\endgroup\$section.write(message)
\$\endgroup\$