I'm working on a simple statistics problem with Pandas
and sklearn
. I'm aware that my code is ugly, but how can I improve it?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
df = pd.read_csv("sphist.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.sort_values(["Date"], inplace=True)
df["day_5"] = np.nan
df["day_30"] = np.nan
df["std_5"] = np.nan
for i in range(30, len(df)):
last_5 = df.iloc[i-5:i, 4]
last_30 = df.iloc[i-30:i, 4]
df.iloc[i, -3] = last_5.mean()
df.iloc[i, -2] = last_30.mean()
df.iloc[i, -1] = last_5.std()
df = df.iloc[30:]
df.dropna(axis=0, inplace=True)
train = df[df["Date"] < datetime(2013, 1, 1)]
test = df[df["Date"] >= datetime(2013, 1, 1)]
# print(train.head(), test.head())
X_cols = ["day_5", "day_30", "std_5"]
y_col = "Close"
lr = LinearRegression()
lr.fit(train[X_cols], train[y_col])
yhat = lr.predict(test[X_cols])
mse = mean_squared_error(yhat, test[y_col])
rmse = mse/len(yhat)
score = lr.score(test[X_cols], test[y_col])
print(rmse, score)
plt.scatter(yhat, test[y_col], c="k", s=1)
plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
plt.show()
- It relies on hard-code iloc indices, which is hard to read or maintain. How can I change it to column names/row names?
- The codes look messy. Any advice to improve it?
-
1\$\begingroup\$ It's hard to provide guidance on using column name without knowing what columns your CSV contains. Can you include an example (like the first 10 lines for instance) of your dataset? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2019年01月23日 08:50:29 +00:00Commented Jan 23, 2019 at 8:50
1 Answer 1
functions
This is 1 long script. Partition the code in logical blocks. This could be like this
- get the raw data
- summarize the data
- split the test- and train data
- get the result from the regression
- plot the results
magical values
there are some magical values in your code, for example 4
as the column, datetime(2013, 1, 1)
as the threshold to split the data. Define them as variables (or parameters for the functions)
dummy data
to illustrate this, I use this dummy data
def generate_dummy_data(
x_label="x",
date_label="date",
size=100,
seed=0,
start="20120101",
freq="7d",
):
np.random.seed(seed)
return pd.DataFrame(
{
"Close": np.random.randint(100, 200, size=size),
x_label: np.random.randint(1000, 2000, size=size),
date_label: pd.DatetimeIndex(start=start, freq=freq, periods=size),
}
)
summarize
The rolling mean and std you do can be done with builtin pandas functionality
You also change the raw data. It would be better to make this summary a different DataFrame, and not alter the original data.
def summarize(df, date_label, x_label, y_label="Close"):
return pd.DataFrame(
{
y_label: df[y_label],
date_label: df[date_label],
"day_5": df[x_label].rolling(5).mean(),
"std_5": df[x_label].rolling(5).std(),
"day_30": df[x_label].rolling(30).mean(),
}
).dropna()
regression
here I followed pep-8, and renamed X_cols
to x_cols
def regression(train, test, x_cols, y_col):
lr = LinearRegression()
lr.fit(train[x_cols], train[y_col])
yhat = lr.predict(test[x_cols])
mse = mean_squared_error(yhat, test[y_col])
rmse = mse/len(yhat)
score = lr.score(test[x_cols], test[y_col])
return yhat, rmse, score
main guard
If you put the calling code behind if __name__ == "__main__":
, you can import this script in other code without running the analysis, and reuse the functions
if __name__ == "__main__":
x_label = "x"
date_label = "date"
y_label = "Close"
data = generate_dummy_data(
x_label=x_label, date_label=date_label, y_label=y_label
)
summary = summarize(
data, date_label=date_label, x_label=x_label, y_label=y_label
)
threshold = "20130101"
train = summary.loc[summary[date_label] < threshold]
test = summary.loc[summary[date_label] >= threshold]
x_cols = ["day_5", "std_5", "day_30"]
yhat, rmse, score = regression(train, test, x_cols, y_col)
print(x_cols, rmse, score)
plt.scatter(yhat, test[y_col], c="k", s=1)
plt.plot(
[0.95 * yhat.min(), 1.05 * yhat.max()],
[0.95 * yhat.min(), 1.05 * yhat.max()],
c="r",
)
plt.show()
If you want to compare what each of the 3 metrics do individually, you'll have to do something like this:
for x_label in x_cols:
yhat, rmse, score = regression(train, test, [x_label], y_col)
print(x_label, rmse, score)
plt.scatter(yhat, test[y_col], c="k", s=1)
plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
plt.show()