How to turn this working A/B optimization program into more pythonic code?

Question 1

I have a working solution for the given problem of A/B Optimization, but you as an experienced Python (3) developer will agree with me, that this solution is not very pythonic at all.

My Question here is: How do I make this code more pythonic?

It's a very procedural process that one would expect from a Basic program, but not from a Python program. Please help me to become a better coder and make pythonic code by working through this example with me.

Here's the problem and my solution.

Problem

In this test you will be given a CSV containing the results of a website's A/B homepage test. The site is simultaneously testing 5 design variables (A, B, C, D, and E) and tracking if users clicked on the signup button (Z). Each line of the CSV represents a visitor to the homepage, what they saw, and if they clicked or not. The A-E variables will have values of 1-5, representing which version of that design element was shown to the user. The Z variable will contain either a 1 to represent that the user clicked signup, or a 0 to represent that the user did not.
A B C D E Z
1 3 1 2 4 1
3 2 1 5 5 0
4 3 2 2 5 1
5 5 2 3 4 0
2 4 1 3 4 0
...
Your task will be to determine the single combination of A-E values that will make users most likely to click the signup button. Assume the effects of each A-E value are mutually exclusive. If two values of a variable are equally optimal, choose the lesser value. You will enter the answer in the form of a 5-digit number, the first digit being the optimal A value, the second being the optimal B value, the third being the optimal C value, etc.

Solution

from urllib.request import urlopen
def getBestDesign(url):
 data = urlopen(url).readlines()
 A = {}; B = {}; C = {}; D = {}; E = {};
 A['1'] = 0
 A['2'] = 0
 A['3'] = 0
 A['4'] = 0
 A['5'] = 0
 B['1'] = 0
 B['2'] = 0
 B['3'] = 0
 B['4'] = 0
 B['5'] = 0
 C['1'] = 0
 C['2'] = 0
 C['3'] = 0
 C['4'] = 0
 C['5'] = 0
 D['1'] = 0
 D['2'] = 0
 D['3'] = 0
 D['4'] = 0
 D['5'] = 0
 E['1'] = 0
 E['2'] = 0
 E['3'] = 0
 E['4'] = 0
 E['5'] = 0
 for row in data:
 row = row.decode(encoding='UTF-8').split('\t')
 if row[5].startswith('1'):
 A[str(row[0])] += 1
 B[str(row[1])] += 1
 C[str(row[2])] += 1
 D[str(row[3])] += 1
 E[str(row[4])] += 1
 maxclicks_A = max(A.values())
 maxclicks_B = max(B.values())
 maxclicks_C = max(C.values())
 maxclicks_D = max(D.values())
 maxclicks_E = max(E.values())
 for k, v in A.items():
 if v == maxclicks_A:
 print('A: ', k)
 for k, v in B.items():
 if v == maxclicks_B:
 print('B: ', k)
 for k, v in C.items():
 if v == maxclicks_C:
 print('C: ', k)
 for k, v in D.items():
 if v == maxclicks_D:
 print('D: ', k)
 for k, v in E.items():
 if v == maxclicks_E:
 print('E: ', k)
if __name__ == "__main__":
 url = input('Url: ')
 getBestDesign(url)

Question 2

There is massive code duplication in your script. Basically you are doing the same things for A through E, so you could just put them in a list of dictionaries and use loops. Also, by using defaultdict you do not have to initialize them per hand (and even if not you could use a loop for this, too). Also, I'd recommend removing the code for transforming the URL to a 2D array from the function, and using just split() without '\t' the code works as well, and is somewhat more robust to other input formats.

from urllib.request import urlopen
from collections import defaultdict
def getBestDesign(data):
 # this list holds all your A, B, C, ... dicts, with defaultvalues
 ABCDE = [defaultdict(int) for i in range(5)]
 for row in data:
 if row[5] == '1':
 for i in range(5):
 ABCDE[i][row[i]] += 1
 for i in range(5):
 maxclicks = max(ABCDE[i].values())
 for k, v in ABCDE[i].items():
 if v == maxclicks:
 print('ABCDE'[i], ": ", k) # note the ''!
 # break # optional, stop after first "max" value
if __name__ == "__main__":
 url = input('Url: ')
 lines = urlopen(url).readlines()
 data = [line.decode(encoding='UTF-8').split() for line in lines]
 getBestDesign(data)

The algorithms is still basically the same as yours, and I think that it can still be improved. Probably you could use numpy to treat the input data as a matrix and do all sorts of useful matrix operations on the data. Further, you could probably use collections.Counter or itertools.groupby for counting the good combinations, reducing the whole thing to two or three lines.

Here's another version, even shorter, doing some preprocessing and using collections.Counter:

def getBestDesign(data):
 filtered = [d[:-1] for d in data if d[5] == '1']
 for i, d in enumerate(zip(*filtered)):
 key, maxclicks = Counter(d).most_common(1)[0]
 print("{0}: {1} ({2})".format("ABCDE"[i], key, maxclicks))

Let's explain this line by line:

the filtered list comprehension holds just the lines from data where the user has clicked; as this is already filtered, we can drop the last column with [:-1]
in the loop, we iterate the data columns -- zip(*...) basically transposes the matrix -- using enumerate, which yields the index i and the actual data d at that index
the Counter class is used to count each variant and to return the most_common one
finally, we print the result using string.format, using the index i to get the respective letter from the string "ABCDE"

Question 3

+1, but it seems wrong to me that you have one loop over ABCDE[i].values() to find maxclicks and then a second loop over ABCDE[i].items() to find the key. Better to do this in a single loop: k, maxclicks = max(ABCDE[i].items(), key=operator.itemgetter(1))

Question 4

@GarethRees You are right! When I wrote this, I thought "this seems wrong, there has to be a better way" all the time, but did not catch it. Thanks! (Then again, this way there will always be only one max value, whereas there may be multiple.)

tobias_k tobias_k 1,69914 silver badges18 bronze badges · Answer 1 · 2013-11-02 13:04:19Z

There is massive code duplication in your script. Basically you are doing the same things for A through E, so you could just put them in a list of dictionaries and use loops. Also, by using defaultdict you do not have to initialize them per hand (and even if not you could use a loop for this, too). Also, I'd recommend removing the code for transforming the URL to a 2D array from the function, and using just split() without '\t' the code works as well, and is somewhat more robust to other input formats.

from urllib.request import urlopen
from collections import defaultdict
def getBestDesign(data):
 # this list holds all your A, B, C, ... dicts, with defaultvalues
 ABCDE = [defaultdict(int) for i in range(5)]
 for row in data:
 if row[5] == '1':
 for i in range(5):
 ABCDE[i][row[i]] += 1
 for i in range(5):
 maxclicks = max(ABCDE[i].values())
 for k, v in ABCDE[i].items():
 if v == maxclicks:
 print('ABCDE'[i], ": ", k) # note the ''!
 # break # optional, stop after first "max" value
if __name__ == "__main__":
 url = input('Url: ')
 lines = urlopen(url).readlines()
 data = [line.decode(encoding='UTF-8').split() for line in lines]
 getBestDesign(data)

The algorithms is still basically the same as yours, and I think that it can still be improved. Probably you could use numpy to treat the input data as a matrix and do all sorts of useful matrix operations on the data. Further, you could probably use collections.Counter or itertools.groupby for counting the good combinations, reducing the whole thing to two or three lines.

Here's another version, even shorter, doing some preprocessing and using collections.Counter:

def getBestDesign(data):
 filtered = [d[:-1] for d in data if d[5] == '1']
 for i, d in enumerate(zip(*filtered)):
 key, maxclicks = Counter(d).most_common(1)[0]
 print("{0}: {1} ({2})".format("ABCDE"[i], key, maxclicks))

Let's explain this line by line:

the filtered list comprehension holds just the lines from data where the user has clicked; as this is already filtered, we can drop the last column with [:-1]
in the loop, we iterate the data columns -- zip(*...) basically transposes the matrix -- using enumerate, which yields the index i and the actual data d at that index
the Counter class is used to count each variant and to return the most_common one
finally, we print the result using string.format, using the index i to get the respective letter from the string "ABCDE"

+1, but it seems wrong to me that you have one loop over ABCDE[i].values() to find maxclicks and then a second loop over ABCDE[i].items() to find the key. Better to do this in a single loop: k, maxclicks = max(ABCDE[i].items(), key=operator.itemgetter(1))
@GarethRees You are right! When I wrote this, I thought "this seems wrong, there has to be a better way" all the time, but did not catch it. Thanks! (Then again, this way there will always be only one max value, whereas there may be multiple.)

Stack Exchange Network

How to turn this working A/B optimization program into more pythonic code?

Problem

Solution

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to turn this working A/B optimization program into more pythonic code?

Problem

Solution

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions