5
\$\begingroup\$

The below code takes a csv containing age, weight, height and prints the betas determined through linear regression to an output csv. It runs for 10 iterations using a different alpha for each, and prints the results of each iteration on a separate row of the output csv.

In particular, I'm wondering if there is a cleaner way to scale the data (relevant snippet copied below)? I scale both the features (age, weight) and the label (height), and then create a new list that concatenates the scaled data from the features and the original (non-scaled) data for the label.

def scale_data(self):
 # new array = (x_i - mean) / stdv
 sd = np.std(self.data, axis=0)
 mean = np.mean(self.data, axis=0)
 trans = np.transpose(self.data)
 scaled = [np.divide([np.subtract(trans[i], j)
 for i, j in enumerate(mean)][n], m)
 for n, m in enumerate(sd)]
 return np.concatenate((scaled[0:2],
 [np.transpose(self.data)[-1]]), axis=0)

I'm new to numpy and pretty new to programming, so also welcome general feedback on the program overall!

Full program:

import sys
import numpy as np
import csv
import itertools
CONDITIONS = [(0.001,100),
 (0.005,100),
 (0.01,100),
 (0.05,100),
 (0.1,100),
 (0.5,100),
 (1.,100),
 (5.,100),
 (10.,100),
 (.75,1000)]
class LinearRegression(object):
 def __init__(self, inp, out, iterations, weights=None):
 self.input = inp
 self.output = out
 self.iterations = iterations
 self.weights = weights or [0.0, 0.0, 0.0]
 self.data = None
 self.features = None
 self.labels = None
 def get_data(self):
 self.data = [map(float, i) for i in csv.reader(open(self.input))]
 def scale_data(self):
 # new array = (x_i - mean) / stdv
 sd = np.std(self.data, axis=0)
 mean = np.mean(self.data, axis=0)
 trans = np.transpose(self.data)
 scaled = [np.divide([np.subtract(trans[i], j)
 for i, j in enumerate(mean)][n], m)
 for n, m in enumerate(sd)]
 return np.concatenate((scaled[0:2],
 [np.transpose(self.data)[-1]]), axis=0)
 def set_features_labels(self):
 self.get_data()
 scaled = self.scale_data()
 ones_data = np.append([np.ones(len(self.data))], scaled, axis=0)
 self.features = np.transpose(ones_data[:-1])
 self.labels = np.transpose(ones_data[-1])
 def predict_next(self, rate):
 for i in range(self.iterations):
 error = np.dot(self.features, self.weights) - self.labels
 error_j = np.transpose([[error[i] * self.features[i][j]
 for j in range(len(self.weights))]
 for i in range(len(self.features))])
 for i in range(len(self.weights)):
 self.weights[i] -= rate*(1./len(self.features))*sum(error_j[i])
 self.outprocess(rate)
 def outprocess(self, rate):
 out = open(self.output, 'a+')
 output_data = map(str,([rate, self.iterations] + self.weights))
 out.write(','.join(output_data) + '\n')
def main(argv):
 try:
 inp, out = argv
 except:
 print 'useage: problem2.py input2.csv output2.csv'
 sys.exit(2)
 open(out, 'w').close()
 for c in CONDITIONS:
 rate, iterations = c
 lr = LinearRegression(inp, out, iterations, [0.0, 0.0, 0.0])
 lr.set_features_labels()
 lr.predict_next(rate)
if __name__ == "__main__":
 main(sys.argv[1:])

Input data:

2,10.21027,0.8052476
2.04,13.07613,0.9194741
2.13,11.44697,0.9083505
2.21,14.43984,0.8037555
2.29,12.59622,0.811357
2.38,10.5199,0.9489974
2.46,12.89517,0.9664505
2.54,12.11692,0.9288403
2.63,16.76085,0.86205
2.71,11.20934,0.9811632
2.79,13.48913,0.9883778
2.88,11.85574,1.004748
2.96,11.54332,1.001915
3.04,12.90222,0.9934899
3.13,13.03542,0.8974875
3.21,11.88202,0.8887256
3.29,11.99685,0.932307
3.38,11.82981,0.937784
3.46,12.70158,1.05032
3.54,19.58748,1.056727
3.63,16.46093,0.9821247
3.71,15.20721,0.91031
3.79,15.37263,1.065316
3.88,14.29485,1.02835
3.96,13.47689,0.9255748
4.04,13.61116,0.9306862
4.13,13.21864,1.101614
4.21,13.02441,0.9921132
4.29,18.04961,1.114548
4.38,18.25533,1.063936
4.46,13.40907,1.008634
4.54,15.51193,1.044635
4.63,15.66975,1.140624
4.71,17.28859,0.9824303
4.79,14.29081,1.062029
4.88,21.63373,1.100134
4.96,14.20687,1.166945
5.04,14.34277,1.161301
5.13,20.16834,1.009289
5.21,25.58315,1.091316
5.29,18.58571,1.097202
5.38,14.8925,1.025529
5.46,16.06749,1.076194
5.54,15.56413,1.114876
5.63,25.83467,1.187193
5.71,17.81035,1.09323
5.79,15.58975,1.069648
5.88,16.83304,1.173466
5.96,15.87089,1.232684
6.04,16.43608,1.057615
6.13,22.90029,1.245461
6.21,28.0358,1.251804
6.29,29.97981,1.167406
6.38,26.52102,1.089877
6.46,21.35797,1.215447
6.54,17.3225,1.118232
6.63,29.70296,1.123483
6.71,30.04782,1.110545
6.79,20.11027,1.309295
6.88,18.73445,1.108909
6.96,22.64564,1.175754
7.04,18.22805,1.149112
7.13,18.38153,1.135427
7.21,32.20984,1.339077
7.29,18.16912,1.330256
7.38,19.73432,1.313696
7.46,19.00511,1.173512
7.54,27.35114,1.159026
7.63,22.04564,1.367356
7.71,19.48502,1.168069
7.79,19.64775,1.263869
7.88,19.23024,1.368565
7.96,20.96755,1.388876
8.04,33.19435,1.318567
8.13,20.31464,1.384097
8.21,29.81185,1.213855
8.29,20.65887,1.218047
8.38,26.82559,1.414521
8.46,40.94614,1.301148
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jun 26, 2017 at 14:19
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

So looking at the scale_data() method,

Use numpy arrays and vectorized operations:

So the heart of the scale data is the double list comprehension:

# new array = (x_i - mean) / stdv
sd = np.std(self.data, axis=0)
mean = np.mean(self.data, axis=0)
trans = np.transpose(self.data)
scaled = [np.divide([np.subtract(trans[i], j)
 for i, j in enumerate(mean)][n], m)
 for n, m in enumerate(sd)]

If you use the data as a numpy array, you can simply do those operations as:

data = np.array(self.data) 
scaled = (data - data.mean(axis=0)) / data.std(axis=0)

I think this is transposed from your result, but hopefully this shows the possibilities. Sorry, I didn't take the time to understand what was going on with the concat, but the heart of any improvement in the method is above.

Why numpy arrays?

I started to write something on this, but instead just go here:

https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists

Also the numpy docs have this on broadcasting.

answered Jun 27, 2017 at 4:32
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.