6
\$\begingroup\$

I've written a quick script for a coworker to split a large CAN log into smaller chunks. (If you're not familiar with CAN, it's a communication protocol used by the ECUs in many cars.) I know where to split because I've inserted dummy CAN messages (with ID 0x00) at the start of each section, and one at the end of testing (which may be somewhere in the middle of the log) to tell me when to stop reading.

The log is in .asc or .csv format, and can be several gigabytes in size. Currently I can process a 1.5GB file in about 40 seconds, but I'm sure that can be improved. I'm looking more for advice on how to speed this up than to make it more Pythonic, but of course criticism is welcome in both areas.

Note: titles is a dictionary mapping section numbers to a particular string that needs to be added to the filename before saving. I can add the code for generating these, but I don't believe it's as relevant.

def split_asc_file(target_file, target_dir, titles):
 import os
 import time
 start_time = time.time()
 if not os.path.isdir(target_dir):
 os.mkdir(target_dir)
 os.chdir(target_dir)
 section = None
 def create_title(message_string):
 req_num = int(message_string[0:8])
 obj_num = int(message_string[8:16])
 if req_num == 0 and obj_num == 0:
 print "Splitting completed in {} seconds".format(time.time() - start_time)
 quit() # final test has been executed
 else:
 at = "AT{}_{}".format(req_num, obj_num)
 title_prefix = titles[at]
 title_string = "{}_{}.asc".format(title_prefix, at)
 return title_string
 def can_traffic_only(f):
 # iterate only over lines that contain messages
 for line in f:
 if len(line.split()) == 14:
 yield line
 with open(target_file) as log:
 print "Opening {}...".format(target_file)
 for message in can_traffic_only(log):
 values = message.split()
 can_id = values[2]
 can_data = "".join(values[6:])
 if can_id == "0":
 if section:
 section.close()
 title = create_title(can_data)
 if title:
 print "Creating {}".format(title)
 section = open(title, "w")
 else:
 if section:
 section.write(message)
 print "Splitting completed in {} seconds".format(time.time() - start_time)
200_success
146k22 gold badges190 silver badges479 bronze badges
asked Jun 23, 2016 at 20:59
\$\endgroup\$
3
  • \$\begingroup\$ in can_traffic_only you split a line and check for the number of parts and in the other part you split that line again. can_traffic_only could return the list of parts so that the second split can be eliminated. \$\endgroup\$ Commented Jun 24, 2016 at 13:32
  • \$\begingroup\$ I originally had it this way, but I realized I needed the complete message for the write here: section.write(message) \$\endgroup\$ Commented Jun 24, 2016 at 13:50
  • \$\begingroup\$ Oops, I missed that, but of course you could return both. \$\endgroup\$ Commented Jun 24, 2016 at 14:12

2 Answers 2

3
\$\begingroup\$

What I get from your code is that, you skip messages until the first dummy one which indicate the first section and then you have the following cycle:

  • Extract title information out of the dummy message;
  • Open a file to extract out messages of this section into it;
  • Write relevant messages until the next dummy one.

Reorganizing your code to follow this layout more closely can lead you to remove you if section tests which are executed at each line and may be slowing thing a bit.

You can also remove your if title since create_title will never return anything other than a string of more than 5 characters. But I guess that it was used before to check for the end of the tests and I’ll reuse that.

By combining that with proposals by @ferada, you can end up with:

import os
import time
def create_title(message_string, titles):
 req_num = int(message_string[0:8])
 obj_num = int(message_string[8:16])
 if not req_num and not obj_num:
 return
 at = "AT{}_{}".format(req_num, obj_num)
 title_prefix = titles[at]
 return "{}_{}.asc".format(title_prefix, at)
def split_asc_file(target_file, target_dir, titles): 
 if not os.path.isdir(target_dir):
 os.makedirs(target_dir)
 os.chdir(target_dir)
 with open(target_file) as log:
 print 'Opening', target_file
 # Bootstrap
 for message in log:
 data = message.split()
 if len(data) == 14 and data[2] == "0":
 break
 while True:
 # Using message rather than reusing data here; see next comment
 data = message.split()[6:]
 title = create_title(''.join(data), titles)
 if title is None:
 break
 with open(title, 'w') as section:
 print 'Created', title
 for message in log:
 # Knowing the input format, you should be able to extract
 # the same information than the next two ifs by analyzing
 # message rather than splitting it, as ferada suggested
 data = message.split()
 if len(data) == 14:
 if data[2] == "0":
 break
 section.write(message)
if __name__ == '__main__':
 start_time = time.time()
 split_asc_file(..,..,..) #Whatever
 print "Splitting completed in {} seconds".format(time.time() - start_time)

The workflow I proposed let you also open the section file using a with statement which is prefered in python. I also changed mkdir in makedirs, just in case.

answered Jun 26, 2016 at 9:48
\$\endgroup\$
4
\$\begingroup\$

That looks like it's close to what you're going to get with Python I think.

I'd suggest taking a profiler and optimising according to that; e.g. I can imagine that doing less work using split and instead just counting the number of spaces (instead of allocating all the results) should be a bit faster (in can_traffic_only).

can_data can be delayed till the condition for the if block is true, but again, depends on how often that's the case.

If there's nothing else you could inline can_traffic_only and see if that makes a difference.

answered Jun 24, 2016 at 13:53
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.