I have just been informed that the following code written by me is extremely poor. I have absolutely no idea why. It is memory efficient, and looks clean to me. But still the feedback is very poor. I have no clue on why. If someone can put some comments, I will be highly grateful. I have to pass file name from command line to make it work - actually that is what they asked.
import re
import numpy as np
from collections import Counter
import sys
def parse_server_log(path_to_file):
#First check whether it's a legal file name
try:
f = open(path_to_file)
except:
print "\nInvalid file name and/or path. Please correct!\n"
return
# end of sanity check on file
# The following regexes would extract the concerned values from each line
# of the file.
method_regx = re.compile("(method=)+[A-Z]+[\s]+")
path_regx = re.compile("(path=)+[/\w.\"]+[\s]+")
dyno_regx = re.compile("(dyno=)+[\w.]+[\s]+")
connect_regx = re.compile("(connect=)+[\w.]+[\s]+")
service_regx = re.compile("(service=)+[\w.]+[\s]+")
# Target values for each urls
url1 = [0, [], []]
url2 = [0, [], []]
url3 = [0, [], []]
url4 = [0, [], []]
url5 = [0, [], []]
url6 = [0, [], []]
# url matcher regex for each url type
url1_regex = re.compile("(#)+(/api/users/)+[\d]+(/count_pending_messages)+(#)+")
url2_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_messages)+(#)+")
url3_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_progress)+(#)+")
url4_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_score)+(#)+")
url5_6_regex = re.compile("(#)+(/api/users/)+[\d]+(#)+")
with open(path_to_file) as lines:
for my_data in lines:
# Now lets separate out the method, path, dyno, connect and service times
k = method_regx.search(my_data)
try:
line_method = k.group(0).split("=")[1].strip()
except:
line_method = ""
k = path_regx.search(my_data)
try:
# The hashes are added at the start and end to make sure the path
# is not a part of a string rather the entire string!
line_path = "#" + k.group(0).split("=")[1].strip() + "#"
except:
line_path = ""
k = dyno_regx.search(my_data)
try:
line_dyno = k.group(0).split("=")[1].strip()
except:
line_dyno = ""
k = connect_regx.search(my_data)
try:
line_connect_time = float(k.group(0).split("=")[1].split("ms")[0])
except:
line_connect_time = 0
k = service_regx.search(my_data)
try:
line_service_time = float(k.group(0).split("=")[1].split("ms")[0])
except:
line_service_time = 0
# End of getting all the data
# Now match up the URL and do this under sanity checker
if(len(line_method) > 0 and len(line_path) > 0):
url_denoter = 0
if url1_regex.match(line_path) is not None:
url_denoter = 1
elif url2_regex.match(line_path) is not None:
url_denoter = 2
elif url3_regex.match(line_path) is not None:
url_denoter = 3
elif url4_regex.match(line_path) is not None:
url_denoter = 4
elif url5_6_regex.match(line_path) is not None:
url_denoter = 5
# OK so now we have the url to which the current url matched
if(url_denoter==1 and line_method=="GET"):
"""
for GET /api/users/{user_id}/count_pending_messages
"""
url1[0] += 1
url1[1].append(line_dyno)
url1[2].append(line_connect_time + line_service_time)
elif(url_denoter==2 and line_method=="GET"):
"""
for GET /api/users/{user_id}/get_messages
"""
url2[0] += 1
url2[1].append(line_dyno)
url2[2].append(line_connect_time + line_service_time)
"""
Now print the results!
"""
# for GET /api/users/{user_id}/count_pending_messages
print "\n------GET /api/users/{user_id}/count_pending_messages----\n"
print "Number of times the url is called: ", url1[0]
if(url1[0]>0):
my_num_list = url1[2]
print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
counts = [(x, url1[1].count(x)) for x in set(url1[1])]
swap = 0
tar_dyno = ""
for count in counts:
if(count[1]> swap):
swap = count[1]
tar_dyno = count[0]
print "Dyno that responded the most: ", tar_dyno
else:
print "Sorry no parameters can be calculated as the url has not been accessed!"
print "\n------GET /api/users/{user_id}/get_messages----\n"
print "Number of times the url is called: ", url2[0]
if(url2[0]>0):
my_num_list = url2[2]
print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
counts = [(x, url2[1].count(x)) for x in set(url2[1])]
swap = 0
tar_dyno = ""
for count in counts:
if(count[1]> swap):
swap = count[1]
tar_dyno = count[0]
print "Dyno that responded the most: ", tar_dyno
else:
print "Sorry no parameters can be calculated as the url has not beenaccessed!"
print "\n------GET /api/users/{user_id}/get_friends_progress----\n"
print "Number of times the url is called: ", url3[0]
if(__name__=="__main__"):
parse_server_log(sys.argv[1])
4 Answers 4
Firstly, Python has a style guide and (unless you're given a different guide, in which case please provide a link to it) you should follow it. Your code generally follows it, but note that the import
s are in the wrong order, it should be:
from collections import Counter
import re
import sys
import numpy as np
Note alphabetical order, and a split between standard library and 3rd-party modules.
Repetition is a big issue when writing good code, and this was an immediate red flag:
url1 = [0, [], []]
url2 = [0, [], []]
url3 = [0, [], []]
url4 = [0, [], []]
url5 = [0, [], []]
url6 = [0, [], []]
Why not make a list, urls
, containing these structures? You can cut this to one line, so if you change the structure later you only do it once:
urls = [[0, [], []] for _ in range(6)]
There are numerous other elements of repetition in your code, which can be reduced in a similar way.
parse_server_log
is much too long. You should split it up into smaller, self-contained functions with sensible parameters and return
values, which will make your code much easier to read, understand and maintain. Each should have a docstring explaining exactly what it does. This will also help in reducing repetition.
Bare except
is a bad idea - at the very least, you should use except Exception
, but much better is to figure out what errors could occur, and to handle them explicitly. See e.g. "The evils of except
".
You should use string formatting, e.g.
print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
should be
print "Average response time: {0:.2f} in ms.".format(np.average(my_num_list))
Use tuple packing and unpacking, e.g.
for count in counts:
if(count[1]> swap):
swap = count[1]
tar_dyno = count[0]
could be:
for dyno_, swap_ in counts:
if swap_ > swap:
swap, tar_dyno = swap_, dyno_
Your file check
try:
f = open(path_to_file)
except:
print "\nInvalid file name and/or path. Please correct!\n"
return
never closes the file, which stays open through the whole function. You can check if a file exists in Python using the os
module; do that instead. Alternatively, take a look at this SO answer.
-
8\$\begingroup\$ A small point re your last point - if the file can't be opened there is nothing to close anyway. And it's usually better to try to open and handle the exception, rather than doing an existence check beforehand. Even an existence check can be misleading because the file might exist at one moment and be deleted the next. \$\endgroup\$Greg Hewgill– Greg Hewgill2014年11月04日 19:17:46 +00:00Commented Nov 4, 2014 at 19:17
-
\$\begingroup\$ @GregHewgill you're right, but if the file isn't there the function ends anyway, and the current arrangement doesn't protect against file deletion either. \$\endgroup\$jonrsharpe– jonrsharpe2014年11月04日 19:21:16 +00:00Commented Nov 4, 2014 at 19:21
-
3\$\begingroup\$ Isn't it better to use
with open(path_to_file) as f
wrapped in atry...except EnvironmentError
as suggested here? (Although the modifiedwith
from another answer there looks neat, too) \$\endgroup\$Tobias Kienzler– Tobias Kienzler2014年11月05日 13:34:32 +00:00Commented Nov 5, 2014 at 13:34 -
\$\begingroup\$ @TobiasKienzler thanks, I've added that into my answer. \$\endgroup\$jonrsharpe– jonrsharpe2014年11月05日 13:35:58 +00:00Commented Nov 5, 2014 at 13:35
-
\$\begingroup\$ also- only need one regex, maybe two. \$\endgroup\$300D7309EF17– 300D7309EF172014年11月07日 18:00:25 +00:00Commented Nov 7, 2014 at 18:00
+
in regex means Once or more
. I think in your case you are mistaking it for a concatenation operator.
I think you could factorize a lot with the help of better regex. For example:
method_regx = re.compile("(method=)+[A-Z]+[\s]+")
# ...
k = method_regx.search(my_data)
try:
line_method = k.group(0).split("=")[1].strip()
except:
Could be rewritten
method_regx = re.compile("method=([A-Z]+)[\s]+")
# ...
k = method_regx.search(my_data)
line_method = k.group(1) if k else ''
Since k will be None
if the search did not work.
Likewise, you can probably find a regex that will do all the work and allow you to remove the split
from
line_connect_time = float(k.group(0).split("=")[1].split("ms")[0])
-
1\$\begingroup\$ Actually, no, I know the meaning of + in regex context! But yes, I agree, that it could have been better as you suggested. Thanks mate! \$\endgroup\$user3001408– user30014082014年11月05日 06:28:12 +00:00Commented Nov 5, 2014 at 6:28
-
\$\begingroup\$ @user3001408 you could try using something like regex101.com/#python to help you develop more precise regular expressions \$\endgroup\$jonrsharpe– jonrsharpe2014年11月05日 09:35:20 +00:00Commented Nov 5, 2014 at 9:35
Great answers already, but I wanted to jump in with one more point about file handling, which actually reflects just as much the point about how parse_server_log
is too long. Consider the first bit of your method:
def parse_server_log(path_to_file):
: : :
try:
f = open(path_to_file)
except:
print "\nInvalid file name and/or path. Please correct!\n"
return
: : :
When this (or any) method is called, it can do some combination of things:
- read or write global data
- read or write to arguments that are passed to it
- return a single object
- throw an exception
In this case, you've chosen to make parse_server_log
read an argument that is passed to it, read global data (the contents of the file), write to global data (via print statements), and return None
. In some less common scenarios, or if there are bugs in your code, it's possible it may also throw an exception.
But as a consumer of this code, how can I tell what happened? Short of hacks like substituting a StringIO
object for sys.stdout
, then parsing it, I can't. All that's available is, as a user invoking the script, I can read the output. This is very limiting. If parse_server_log
does everything you want it to do, this is fine. It's a quality one-off script. And it can be modified to handle new needs that come up.
But it may be more useful to separate its concerns. parse_server_log
sounds like it should return a data structure that represents the contents of the log. Another function could process it, and a third could summarize it with print statements. If it were structured like this, you would need to know whether parse_server_log
succeeded or failed. This could be by changing what it returns, or it could be by throwing an exception.
All of that is a very long winded way of saying this: consider letting open(path_to_file)
throw the exception without catching it. Then a client to parse_server_log
can see the exception and choose how to handle it. In this case Python would just dump that information to the screen, but even that might tell you more than the message you chose to print in its place.
Just as a small (but slightly too large for a comment) footnote to the other answers, as convenient as numpy is, it might be kind of overkill just for computing the average and median of a list. Instead, you can easily do it yourself:
from math import fsum
a = sorted(my_num_list)
n = len(a)
average = fsum(a) / n
median = a[n/2]
or, if you want to get fancy for even-length lists (where the median is not always uniquely defined):
median = float(a[(n-1)/2] + a[n/2]) / 2
The main advantage of eliminating this dependency on numpy, besides slightly reduced memory use and faster loading time, is that your code will work even on systems where numpy may not be available for some reason. Also, to me, it just seems cleaner.
Explore related questions
See similar questions with these tags.
(#)+(/api/users/)+[\d]+(#)+
indicates that each group can be present several times? I don't think+
does what you think it does. Also, later you are not using the fact that you have groups. \$\endgroup\$except
statements should have explicit exception types. E.g., for catching non-existent filesexcept IOError:
. \$\endgroup\$