Someone thinks poorly of my server log parser

Question 1

I have just been informed that the following code written by me is extremely poor. I have absolutely no idea why. It is memory efficient, and looks clean to me. But still the feedback is very poor. I have no clue on why. If someone can put some comments, I will be highly grateful. I have to pass file name from command line to make it work - actually that is what they asked.

import re
import numpy as np
from collections import Counter
import sys
def parse_server_log(path_to_file):
 #First check whether it's a legal file name
 try:
 f = open(path_to_file)
 except:
 print "\nInvalid file name and/or path. Please correct!\n"
 return
 # end of sanity check on file
 # The following regexes would extract the concerned values from each line
 # of the file.
 method_regx = re.compile("(method=)+[A-Z]+[\s]+")
 path_regx = re.compile("(path=)+[/\w.\"]+[\s]+")
 dyno_regx = re.compile("(dyno=)+[\w.]+[\s]+")
 connect_regx = re.compile("(connect=)+[\w.]+[\s]+")
 service_regx = re.compile("(service=)+[\w.]+[\s]+")
 # Target values for each urls
 url1 = [0, [], []]
 url2 = [0, [], []]
 url3 = [0, [], []]
 url4 = [0, [], []]
 url5 = [0, [], []]
 url6 = [0, [], []]
 # url matcher regex for each url type
 url1_regex = re.compile("(#)+(/api/users/)+[\d]+(/count_pending_messages)+(#)+")
 url2_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_messages)+(#)+")
 url3_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_progress)+(#)+")
 url4_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_score)+(#)+")
 url5_6_regex = re.compile("(#)+(/api/users/)+[\d]+(#)+")
 with open(path_to_file) as lines:
 for my_data in lines:
 # Now lets separate out the method, path, dyno, connect and service times
 k = method_regx.search(my_data)
 try: 
 line_method = k.group(0).split("=")[1].strip()
 except:
 line_method = ""
 k = path_regx.search(my_data)
 try:
 # The hashes are added at the start and end to make sure the path
 # is not a part of a string rather the entire string!
 line_path = "#" + k.group(0).split("=")[1].strip() + "#" 
 except:
 line_path = ""
 k = dyno_regx.search(my_data)
 try: 
 line_dyno = k.group(0).split("=")[1].strip() 
 except:
 line_dyno = ""
 k = connect_regx.search(my_data)
 try: 
 line_connect_time = float(k.group(0).split("=")[1].split("ms")[0]) 
 except:
 line_connect_time = 0
 k = service_regx.search(my_data)
 try: 
 line_service_time = float(k.group(0).split("=")[1].split("ms")[0]) 
 except:
 line_service_time = 0
 # End of getting all the data
 # Now match up the URL and do this under sanity checker
 if(len(line_method) > 0 and len(line_path) > 0):
 url_denoter = 0
 if url1_regex.match(line_path) is not None:
 url_denoter = 1
 elif url2_regex.match(line_path) is not None:
 url_denoter = 2
 elif url3_regex.match(line_path) is not None:
 url_denoter = 3
 elif url4_regex.match(line_path) is not None:
 url_denoter = 4
 elif url5_6_regex.match(line_path) is not None:
 url_denoter = 5 
 # OK so now we have the url to which the current url matched
 if(url_denoter==1 and line_method=="GET"):
 """
 for GET /api/users/{user_id}/count_pending_messages
 """
 url1[0] += 1
 url1[1].append(line_dyno)
 url1[2].append(line_connect_time + line_service_time)
 elif(url_denoter==2 and line_method=="GET"):
 """
 for GET /api/users/{user_id}/get_messages
 """
 url2[0] += 1
 url2[1].append(line_dyno)
 url2[2].append(line_connect_time + line_service_time)
 """
 Now print the results!
 """
 # for GET /api/users/{user_id}/count_pending_messages
 print "\n------GET /api/users/{user_id}/count_pending_messages----\n"
 print "Number of times the url is called: ", url1[0]
 if(url1[0]>0):
 my_num_list = url1[2]
 print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
 print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
 print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
 counts = [(x, url1[1].count(x)) for x in set(url1[1])]
 swap = 0
 tar_dyno = ""
 for count in counts:
 if(count[1]> swap):
 swap = count[1]
 tar_dyno = count[0]
 print "Dyno that responded the most: ", tar_dyno 
 else:
 print "Sorry no parameters can be calculated as the url has not been accessed!"
 print "\n------GET /api/users/{user_id}/get_messages----\n"
 print "Number of times the url is called: ", url2[0]
 if(url2[0]>0):
 my_num_list = url2[2]
 print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
 print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
 print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
 counts = [(x, url2[1].count(x)) for x in set(url2[1])]
 swap = 0
 tar_dyno = ""
 for count in counts:
 if(count[1]> swap):
 swap = count[1]
 tar_dyno = count[0]
 print "Dyno that responded the most: ", tar_dyno 
 else:
 print "Sorry no parameters can be calculated as the url has not beenaccessed!"
 print "\n------GET /api/users/{user_id}/get_friends_progress----\n"
 print "Number of times the url is called: ", url3[0]
if(__name__=="__main__"):
 parse_server_log(sys.argv[1])

Question 2

Although you already have an excellent answer, I would encourage you to discuss the matter with the person who reviewed your code. In particular, did they give you specific feedback, does it line up with the feedback you've gotten here, and how can you best apply the feedback to this and future projects?

Question 3

Your code cuts up logs into something useful. Why didn't you call it LumberMill? #missedopportunity

Question 4

your regex are weird. For example (#)+(/api/users/)+[\d]+(#)+ indicates that each group can be present several times? I don't think + does what you think it does. Also, later you are not using the fact that you have groups.

Question 5

"Your code is extremely poor" is hopefully not the exact wording your peer used, otherwise you might as well reply "your review is extremely infertile"

Question 6

Your except statements should have explicit exception types. E.g., for catching non-existent files except IOError:.

Question 7

Firstly, Python has a style guide and (unless you're given a different guide, in which case please provide a link to it) you should follow it. Your code generally follows it, but note that the imports are in the wrong order, it should be:

from collections import Counter
import re
import sys
import numpy as np

Note alphabetical order, and a split between standard library and 3rd-party modules.

Repetition is a big issue when writing good code, and this was an immediate red flag:

url1 = [0, [], []]
url2 = [0, [], []]
url3 = [0, [], []]
url4 = [0, [], []]
url5 = [0, [], []]
url6 = [0, [], []]

Why not make a list, urls, containing these structures? You can cut this to one line, so if you change the structure later you only do it once:

urls = [[0, [], []] for _ in range(6)]

There are numerous other elements of repetition in your code, which can be reduced in a similar way.

parse_server_log is much too long. You should split it up into smaller, self-contained functions with sensible parameters and return values, which will make your code much easier to read, understand and maintain. Each should have a docstring explaining exactly what it does. This will also help in reducing repetition.

Bare except is a bad idea - at the very least, you should use except Exception, but much better is to figure out what errors could occur, and to handle them explicitly. See e.g. "The evils of except".

You should use string formatting, e.g.

print "Average response time: ", round(np.average(my_num_list), 2), " in ms."

should be

print "Average response time: {0:.2f} in ms.".format(np.average(my_num_list))

Use tuple packing and unpacking, e.g.

for count in counts:
 if(count[1]> swap):
 swap = count[1]
 tar_dyno = count[0]

could be:

for dyno_, swap_ in counts:
 if swap_ > swap:
 swap, tar_dyno = swap_, dyno_

Your file check

try:
 f = open(path_to_file)
except:
 print "\nInvalid file name and/or path. Please correct!\n"
 return

never closes the file, which stays open through the whole function. You can check if a file exists in Python using the os module; do that instead. Alternatively, take a look at this SO answer.

Question 8

A small point re your last point - if the file can't be opened there is nothing to close anyway. And it's usually better to try to open and handle the exception, rather than doing an existence check beforehand. Even an existence check can be misleading because the file might exist at one moment and be deleted the next.

Question 9

@GregHewgill you're right, but if the file isn't there the function ends anyway, and the current arrangement doesn't protect against file deletion either.

Question 10

Isn't it better to use with open(path_to_file) as f wrapped in a try...except EnvironmentError as suggested here? (Although the modified with from another answer there looks neat, too)

Question 11

@TobiasKienzler thanks, I've added that into my answer.

Question 12

also- only need one regex, maybe two.

Question 13

+ in regex means Once or more. I think in your case you are mistaking it for a concatenation operator.

I think you could factorize a lot with the help of better regex. For example:

method_regx = re.compile("(method=)+[A-Z]+[\s]+")
# ...
k = method_regx.search(my_data)
try:
 line_method = k.group(0).split("=")[1].strip()
except:

Could be rewritten

method_regx = re.compile("method=([A-Z]+)[\s]+")
# ...
k = method_regx.search(my_data)
line_method = k.group(1) if k else ''

Since k will be None if the search did not work.

Likewise, you can probably find a regex that will do all the work and allow you to remove the split from

line_connect_time = float(k.group(0).split("=")[1].split("ms")[0])

Question 14

Actually, no, I know the meaning of + in regex context! But yes, I agree, that it could have been better as you suggested. Thanks mate!

Question 15

@user3001408 you could try using something like regex101.com/#python to help you develop more precise regular expressions

Question 16

Great answers already, but I wanted to jump in with one more point about file handling, which actually reflects just as much the point about how parse_server_log is too long. Consider the first bit of your method:

def parse_server_log(path_to_file):
 : : :
 try:
 f = open(path_to_file)
 except:
 print "\nInvalid file name and/or path. Please correct!\n"
 return
 : : :

When this (or any) method is called, it can do some combination of things:

read or write global data
read or write to arguments that are passed to it
return a single object
throw an exception

In this case, you've chosen to make parse_server_log read an argument that is passed to it, read global data (the contents of the file), write to global data (via print statements), and return None. In some less common scenarios, or if there are bugs in your code, it's possible it may also throw an exception.

But as a consumer of this code, how can I tell what happened? Short of hacks like substituting a StringIO object for sys.stdout, then parsing it, I can't. All that's available is, as a user invoking the script, I can read the output. This is very limiting. If parse_server_log does everything you want it to do, this is fine. It's a quality one-off script. And it can be modified to handle new needs that come up.

But it may be more useful to separate its concerns. parse_server_log sounds like it should return a data structure that represents the contents of the log. Another function could process it, and a third could summarize it with print statements. If it were structured like this, you would need to know whether parse_server_log succeeded or failed. This could be by changing what it returns, or it could be by throwing an exception.

All of that is a very long winded way of saying this: consider letting open(path_to_file) throw the exception without catching it. Then a client to parse_server_log can see the exception and choose how to handle it. In this case Python would just dump that information to the screen, but even that might tell you more than the message you chose to print in its place.

Question 17

Just as a small (but slightly too large for a comment) footnote to the other answers, as convenient as numpy is, it might be kind of overkill just for computing the average and median of a list. Instead, you can easily do it yourself:

from math import fsum
a = sorted(my_num_list)
n = len(a)
average = fsum(a) / n
median = a[n/2]

or, if you want to get fancy for even-length lists (where the median is not always uniquely defined):

median = float(a[(n-1)/2] + a[n/2]) / 2

The main advantage of eliminating this dependency on numpy, besides slightly reduced memory use and faster loading time, is that your code will work even on systems where numpy may not be available for some reason. Also, to me, it just seems cleaner.

jonrsharpe jonrsharpe 14k2 gold badges36 silver badges62 bronze badges · Accepted Answer · 2014-11-04 11:48:08Z

Firstly, Python has a style guide and (unless you're given a different guide, in which case please provide a link to it) you should follow it. Your code generally follows it, but note that the imports are in the wrong order, it should be:

from collections import Counter
import re
import sys
import numpy as np

Note alphabetical order, and a split between standard library and 3rd-party modules.

Repetition is a big issue when writing good code, and this was an immediate red flag:

url1 = [0, [], []]
url2 = [0, [], []]
url3 = [0, [], []]
url4 = [0, [], []]
url5 = [0, [], []]
url6 = [0, [], []]

Why not make a list, urls, containing these structures? You can cut this to one line, so if you change the structure later you only do it once:

urls = [[0, [], []] for _ in range(6)]

There are numerous other elements of repetition in your code, which can be reduced in a similar way.

parse_server_log is much too long. You should split it up into smaller, self-contained functions with sensible parameters and return values, which will make your code much easier to read, understand and maintain. Each should have a docstring explaining exactly what it does. This will also help in reducing repetition.

Bare except is a bad idea - at the very least, you should use except Exception, but much better is to figure out what errors could occur, and to handle them explicitly. See e.g. "The evils of except".

You should use string formatting, e.g.

print "Average response time: ", round(np.average(my_num_list), 2), " in ms."

should be

print "Average response time: {0:.2f} in ms.".format(np.average(my_num_list))

Use tuple packing and unpacking, e.g.

for count in counts:
 if(count[1]> swap):
 swap = count[1]
 tar_dyno = count[0]

could be:

for dyno_, swap_ in counts:
 if swap_ > swap:
 swap, tar_dyno = swap_, dyno_

Your file check

try:
 f = open(path_to_file)
except:
 print "\nInvalid file name and/or path. Please correct!\n"
 return

never closes the file, which stays open through the whole function. You can check if a file exists in Python using the os module; do that instead. Alternatively, take a look at this SO answer.

A small point re your last point - if the file can't be opened there is nothing to close anyway. And it's usually better to try to open and handle the exception, rather than doing an existence check beforehand. Even an existence check can be misleading because the file might exist at one moment and be deleted the next.
@GregHewgill you're right, but if the file isn't there the function ends anyway, and the current arrangement doesn't protect against file deletion either.
Isn't it better to use with open(path_to_file) as f wrapped in a try...except EnvironmentError as suggested here? (Although the modified with from another answer there looks neat, too)

Stack Exchange Network

Someone thinks poorly of my server log parser

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Someone thinks poorly of my server log parser

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions