How could I improve the following code that runs a simple linear regression using matrix algebra? I import a .csv file (link here) called 'cdd.ny.csv', and perform the matrix calculations that solve for the coefficients (intercept and regressor) of Y = XB (i.e., $(X'X)^{-1}X'Y$):
import numpy
from numpy import *
import csv
df1 = csv.reader(open('cdd.ny.csv', 'rb'),delimiter=',')
tmp = list(df1)
b = numpy.array(tmp).astype('string')
b1 = b[1:,3:5]
b2 = numpy.array(b1).astype('float')
nrow = b1.shape[0]
intercept = ones( (nrow,1), dtype=int16 )
b3 = empty( (nrow,1), dtype = float )
i = 0
while i < nrow:
b3[i,0] = b2[i,0]
i = i + 1
X = numpy.concatenate((intercept, b3), axis=1)
X = matrix(X)
Y = b2[:,1]
Y = matrix(Y).T
m1 = dot(X.T,X).I
m2 = dot(X.T,Y)
beta = m1*m2
print beta
#[[-7.62101913]
# [ 0.5937734 ]]
To check my answer:
numpy.linalg.lstsq(X,Y)
1 Answer 1
import numpy
from numpy import *
import csv
df1 = csv.reader(open('cdd.ny.csv', 'rb'),delimiter=',')
tmp = list(df1)
b = numpy.array(tmp).astype('string')
b1 = b[1:,3:5]
b2 = numpy.array(b1).astype('float')
Firstly, I'd avoid all these abbreviated variables. It makes it hard to follow your code. You can also combine the lines a lot more
b2 = numpy.array(list(df1))[1:,3:5].astype('float')
That way we avoid creating so many variables.
nrow = b1.shape[0]
intercept = ones( (nrow,1), dtype=int16 )
b3 = empty( (nrow,1), dtype = float )
i = 0
while i < nrow:
b3[i,0] = b2[i,0]
i = i + 1
This whole can be replaced by b3 = b2[:,0]
X = numpy.concatenate((intercept, b3), axis=1)
X = matrix(X)
If you really want to use matrix, combine these two lines. But really, its probably better to use just array not matrix.
Y = b2[:,1]
Y = matrix(Y).T
m1 = dot(X.T,X).I
m2 = dot(X.T,Y)
beta = m1*m2
print beta
-
\$\begingroup\$ Thanks! However, the line
X = numpy.concatenate((intercept, b3), axis=1)
now gives the error "ValueError: arrays must have same number of dimensions" -- this is the reason I added the while loop. Any way around this? \$\endgroup\$baha-kev– baha-kev2012年02月26日 17:50:23 +00:00Commented Feb 26, 2012 at 17:50 -
\$\begingroup\$ @baha-kev, use
b3 = b2[:,0].reshape(-1, 1)
\$\endgroup\$Winston Ewert– Winston Ewert2012年02月26日 18:39:46 +00:00Commented Feb 26, 2012 at 18:39 -
\$\begingroup\$ Thanks; you mention it's probably better to use arrays - how do you invert an array? The
.I
command only works on matrix objects. \$\endgroup\$baha-kev– baha-kev2012年02月26日 19:16:07 +00:00Commented Feb 26, 2012 at 19:16 -
\$\begingroup\$ @baha-kev, use the
numpy.lingalg.inv
function. \$\endgroup\$Winston Ewert– Winston Ewert2012年02月26日 19:27:45 +00:00Commented Feb 26, 2012 at 19:27