I have a working solution for the given problem of A/B Optimization, but you as an experienced Python (3) developer will agree with me, that this solution is not very pythonic at all.
My Question here is: How do I make this code more pythonic?
It's a very procedural process that one would expect from a Basic program, but not from a Python program. Please help me to become a better coder and make pythonic code by working through this example with me.
Here's the problem and my solution.
Problem
In this test you will be given a CSV containing the results of a website's A/B homepage test. The site is simultaneously testing 5 design variables (A, B, C, D, and E) and tracking if users clicked on the signup button (Z). Each line of the CSV represents a visitor to the homepage, what they saw, and if they clicked or not. The A-E variables will have values of 1-5, representing which version of that design element was shown to the user. The Z variable will contain either a 1 to represent that the user clicked signup, or a 0 to represent that the user did not.
A B C D E Z 1 3 1 2 4 1 3 2 1 5 5 0 4 3 2 2 5 1 5 5 2 3 4 0 2 4 1 3 4 0 ...
Your task will be to determine the single combination of A-E values that will make users most likely to click the signup button. Assume the effects of each A-E value are mutually exclusive. If two values of a variable are equally optimal, choose the lesser value. You will enter the answer in the form of a 5-digit number, the first digit being the optimal A value, the second being the optimal B value, the third being the optimal C value, etc.
Solution
from urllib.request import urlopen
def getBestDesign(url):
data = urlopen(url).readlines()
A = {}; B = {}; C = {}; D = {}; E = {};
A['1'] = 0
A['2'] = 0
A['3'] = 0
A['4'] = 0
A['5'] = 0
B['1'] = 0
B['2'] = 0
B['3'] = 0
B['4'] = 0
B['5'] = 0
C['1'] = 0
C['2'] = 0
C['3'] = 0
C['4'] = 0
C['5'] = 0
D['1'] = 0
D['2'] = 0
D['3'] = 0
D['4'] = 0
D['5'] = 0
E['1'] = 0
E['2'] = 0
E['3'] = 0
E['4'] = 0
E['5'] = 0
for row in data:
row = row.decode(encoding='UTF-8').split('\t')
if row[5].startswith('1'):
A[str(row[0])] += 1
B[str(row[1])] += 1
C[str(row[2])] += 1
D[str(row[3])] += 1
E[str(row[4])] += 1
maxclicks_A = max(A.values())
maxclicks_B = max(B.values())
maxclicks_C = max(C.values())
maxclicks_D = max(D.values())
maxclicks_E = max(E.values())
for k, v in A.items():
if v == maxclicks_A:
print('A: ', k)
for k, v in B.items():
if v == maxclicks_B:
print('B: ', k)
for k, v in C.items():
if v == maxclicks_C:
print('C: ', k)
for k, v in D.items():
if v == maxclicks_D:
print('D: ', k)
for k, v in E.items():
if v == maxclicks_E:
print('E: ', k)
if __name__ == "__main__":
url = input('Url: ')
getBestDesign(url)
1 Answer 1
There is massive code duplication in your script. Basically you are doing the same things for A
through E
, so you could just put them in a list of dictionaries and use loops. Also, by using defaultdict
you do not have to initialize them per hand (and even if not you could use a loop for this, too). Also, I'd recommend removing the code for transforming the URL to a 2D array from the function, and using just split()
without '\t'
the code works as well, and is somewhat more robust to other input formats.
from urllib.request import urlopen
from collections import defaultdict
def getBestDesign(data):
# this list holds all your A, B, C, ... dicts, with defaultvalues
ABCDE = [defaultdict(int) for i in range(5)]
for row in data:
if row[5] == '1':
for i in range(5):
ABCDE[i][row[i]] += 1
for i in range(5):
maxclicks = max(ABCDE[i].values())
for k, v in ABCDE[i].items():
if v == maxclicks:
print('ABCDE'[i], ": ", k) # note the ''!
# break # optional, stop after first "max" value
if __name__ == "__main__":
url = input('Url: ')
lines = urlopen(url).readlines()
data = [line.decode(encoding='UTF-8').split() for line in lines]
getBestDesign(data)
The algorithms is still basically the same as yours, and I think that it can still be improved. Probably you could use numpy
to treat the input data as a matrix and do all sorts of useful matrix operations on the data. Further, you could probably use collections.Counter
or itertools.groupby
for counting the good combinations, reducing the whole thing to two or three lines.
Here's another version, even shorter, doing some preprocessing and using collections.Counter
:
def getBestDesign(data):
filtered = [d[:-1] for d in data if d[5] == '1']
for i, d in enumerate(zip(*filtered)):
key, maxclicks = Counter(d).most_common(1)[0]
print("{0}: {1} ({2})".format("ABCDE"[i], key, maxclicks))
Let's explain this line by line:
- the
filtered
list comprehension holds just the lines fromdata
where the user has clicked; as this is already filtered, we can drop the last column with[:-1]
- in the loop, we iterate the data columns --
zip(*...)
basically transposes the matrix -- usingenumerate
, which yields the indexi
and the actual datad
at that index - the
Counter
class is used to count each variant and to return themost_common
one - finally, we
print
the result usingstring.format
, using the indexi
to get the respective letter from the string"ABCDE"
-
\$\begingroup\$ +1, but it seems wrong to me that you have one loop over
ABCDE[i].values()
to findmaxclicks
and then a second loop overABCDE[i].items()
to find the key. Better to do this in a single loop:k, maxclicks = max(ABCDE[i].items(), key=operator.itemgetter(1))
\$\endgroup\$Gareth Rees– Gareth Rees2013年11月04日 10:16:16 +00:00Commented Nov 4, 2013 at 10:16 -
\$\begingroup\$ @GarethRees You are right! When I wrote this, I thought "this seems wrong, there has to be a better way" all the time, but did not catch it. Thanks! (Then again, this way there will always be only one max value, whereas there may be multiple.) \$\endgroup\$tobias_k– tobias_k2013年11月04日 10:49:20 +00:00Commented Nov 4, 2013 at 10:49