Using NumPy to scale data in 2 out of 3 columns

Question 1

The below code takes a csv containing age, weight, height and prints the betas determined through linear regression to an output csv. It runs for 10 iterations using a different alpha for each, and prints the results of each iteration on a separate row of the output csv.

In particular, I'm wondering if there is a cleaner way to scale the data (relevant snippet copied below)? I scale both the features (age, weight) and the label (height), and then create a new list that concatenates the scaled data from the features and the original (non-scaled) data for the label.

def scale_data(self):
 # new array = (x_i - mean) / stdv
 sd = np.std(self.data, axis=0)
 mean = np.mean(self.data, axis=0)
 trans = np.transpose(self.data)
 scaled = [np.divide([np.subtract(trans[i], j)
 for i, j in enumerate(mean)][n], m)
 for n, m in enumerate(sd)]
 return np.concatenate((scaled[0:2],
 [np.transpose(self.data)[-1]]), axis=0)

I'm new to numpy and pretty new to programming, so also welcome general feedback on the program overall!

Full program:

import sys
import numpy as np
import csv
import itertools
CONDITIONS = [(0.001,100),
 (0.005,100),
 (0.01,100),
 (0.05,100),
 (0.1,100),
 (0.5,100),
 (1.,100),
 (5.,100),
 (10.,100),
 (.75,1000)]
class LinearRegression(object):
 def __init__(self, inp, out, iterations, weights=None):
 self.input = inp
 self.output = out
 self.iterations = iterations
 self.weights = weights or [0.0, 0.0, 0.0]
 self.data = None
 self.features = None
 self.labels = None
 def get_data(self):
 self.data = [map(float, i) for i in csv.reader(open(self.input))]
 def scale_data(self):
 # new array = (x_i - mean) / stdv
 sd = np.std(self.data, axis=0)
 mean = np.mean(self.data, axis=0)
 trans = np.transpose(self.data)
 scaled = [np.divide([np.subtract(trans[i], j)
 for i, j in enumerate(mean)][n], m)
 for n, m in enumerate(sd)]
 return np.concatenate((scaled[0:2],
 [np.transpose(self.data)[-1]]), axis=0)
 def set_features_labels(self):
 self.get_data()
 scaled = self.scale_data()
 ones_data = np.append([np.ones(len(self.data))], scaled, axis=0)
 self.features = np.transpose(ones_data[:-1])
 self.labels = np.transpose(ones_data[-1])
 def predict_next(self, rate):
 for i in range(self.iterations):
 error = np.dot(self.features, self.weights) - self.labels
 error_j = np.transpose([[error[i] * self.features[i][j]
 for j in range(len(self.weights))]
 for i in range(len(self.features))])
 for i in range(len(self.weights)):
 self.weights[i] -= rate*(1./len(self.features))*sum(error_j[i])
 self.outprocess(rate)
 def outprocess(self, rate):
 out = open(self.output, 'a+')
 output_data = map(str,([rate, self.iterations] + self.weights))
 out.write(','.join(output_data) + '\n')
def main(argv):
 try:
 inp, out = argv
 except:
 print 'useage: problem2.py input2.csv output2.csv'
 sys.exit(2)
 open(out, 'w').close()
 for c in CONDITIONS:
 rate, iterations = c
 lr = LinearRegression(inp, out, iterations, [0.0, 0.0, 0.0])
 lr.set_features_labels()
 lr.predict_next(rate)
if __name__ == "__main__":
 main(sys.argv[1:])

Input data:

2,10.21027,0.8052476
2.04,13.07613,0.9194741
2.13,11.44697,0.9083505
2.21,14.43984,0.8037555
2.29,12.59622,0.811357
2.38,10.5199,0.9489974
2.46,12.89517,0.9664505
2.54,12.11692,0.9288403
2.63,16.76085,0.86205
2.71,11.20934,0.9811632
2.79,13.48913,0.9883778
2.88,11.85574,1.004748
2.96,11.54332,1.001915
3.04,12.90222,0.9934899
3.13,13.03542,0.8974875
3.21,11.88202,0.8887256
3.29,11.99685,0.932307
3.38,11.82981,0.937784
3.46,12.70158,1.05032
3.54,19.58748,1.056727
3.63,16.46093,0.9821247
3.71,15.20721,0.91031
3.79,15.37263,1.065316
3.88,14.29485,1.02835
3.96,13.47689,0.9255748
4.04,13.61116,0.9306862
4.13,13.21864,1.101614
4.21,13.02441,0.9921132
4.29,18.04961,1.114548
4.38,18.25533,1.063936
4.46,13.40907,1.008634
4.54,15.51193,1.044635
4.63,15.66975,1.140624
4.71,17.28859,0.9824303
4.79,14.29081,1.062029
4.88,21.63373,1.100134
4.96,14.20687,1.166945
5.04,14.34277,1.161301
5.13,20.16834,1.009289
5.21,25.58315,1.091316
5.29,18.58571,1.097202
5.38,14.8925,1.025529
5.46,16.06749,1.076194
5.54,15.56413,1.114876
5.63,25.83467,1.187193
5.71,17.81035,1.09323
5.79,15.58975,1.069648
5.88,16.83304,1.173466
5.96,15.87089,1.232684
6.04,16.43608,1.057615
6.13,22.90029,1.245461
6.21,28.0358,1.251804
6.29,29.97981,1.167406
6.38,26.52102,1.089877
6.46,21.35797,1.215447
6.54,17.3225,1.118232
6.63,29.70296,1.123483
6.71,30.04782,1.110545
6.79,20.11027,1.309295
6.88,18.73445,1.108909
6.96,22.64564,1.175754
7.04,18.22805,1.149112
7.13,18.38153,1.135427
7.21,32.20984,1.339077
7.29,18.16912,1.330256
7.38,19.73432,1.313696
7.46,19.00511,1.173512
7.54,27.35114,1.159026
7.63,22.04564,1.367356
7.71,19.48502,1.168069
7.79,19.64775,1.263869
7.88,19.23024,1.368565
7.96,20.96755,1.388876
8.04,33.19435,1.318567
8.13,20.31464,1.384097
8.21,29.81185,1.213855
8.29,20.65887,1.218047
8.38,26.82559,1.414521
8.46,40.94614,1.301148

Question 2

So looking at the scale_data() method,

Use numpy arrays and vectorized operations:

So the heart of the scale data is the double list comprehension:

# new array = (x_i - mean) / stdv
sd = np.std(self.data, axis=0)
mean = np.mean(self.data, axis=0)
trans = np.transpose(self.data)
scaled = [np.divide([np.subtract(trans[i], j)
 for i, j in enumerate(mean)][n], m)
 for n, m in enumerate(sd)]

If you use the data as a numpy array, you can simply do those operations as:

data = np.array(self.data) 
scaled = (data - data.mean(axis=0)) / data.std(axis=0)

I think this is transposed from your result, but hopefully this shows the possibilities. Sorry, I didn't take the time to understand what was going on with the concat, but the heart of any improvement in the method is above.

Why numpy arrays?

I started to write something on this, but instead just go here:

https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists

Also the numpy docs have this on broadcasting.

Stephen Rauch Stephen Rauch 4,31412 gold badges24 silver badges36 bronze badges · Accepted Answer · 2017-06-27 04:32:29Z

So looking at the scale_data() method,

Use numpy arrays and vectorized operations:

So the heart of the scale data is the double list comprehension:

# new array = (x_i - mean) / stdv
sd = np.std(self.data, axis=0)
mean = np.mean(self.data, axis=0)
trans = np.transpose(self.data)
scaled = [np.divide([np.subtract(trans[i], j)
 for i, j in enumerate(mean)][n], m)
 for n, m in enumerate(sd)]

If you use the data as a numpy array, you can simply do those operations as:

data = np.array(self.data) 
scaled = (data - data.mean(axis=0)) / data.std(axis=0)

I think this is transposed from your result, but hopefully this shows the possibilities. Sorry, I didn't take the time to understand what was going on with the concat, but the heart of any improvement in the method is above.

Why numpy arrays?

I started to write something on this, but instead just go here:

https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists

Also the numpy docs have this on broadcasting.

Stack Exchange Network

Using NumPy to scale data in 2 out of 3 columns

1 Answer 1

Use numpy arrays and vectorized operations:

Why numpy arrays?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Using NumPy to scale data in 2 out of 3 columns

1 Answer 1

Use numpy arrays and vectorized operations:

Why numpy arrays?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions