ML Project - Weather Prediction on Jupyter Notebook

Question 1

I am working on a personal Machine-Learning (ML) project to predict weather. Right now, I am working on Jupyter Notebook. Eventually, I will transform it into a Flask app.

I have completed my code on Jupyter Notebook. Everything is working. But I am not sure if I am doing everything in the right way. Would you please review my code on GitHub? https://github.com/SteveAustin583/weather-prediction-ml/blob/main/weather-prediction-ml-stackexchange-feedback-implemented.ipynb

Here is my code:

# ## 1. Setup and Load Data
# %%
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import io # To load CSV from string in this environment
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib
# Display plots inline
%matplotlib inline
# Set some display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
def load_and_preprocess_data(file_path):
 """Loads the dataset and performs initial date conversion."""
 df = pd.read_csv(file_path)
 print("Dataset loaded successfully:")
 df['date'] = pd.to_datetime(df['date']) # Convert date column to datetime
 return df
# --- Load the dataset ---
file_path = 'seattle-weather.csv'
df = load_and_preprocess_data(file_path)
df.head()
# ## 2. Initial Data Exploration & Visualization (Kaggle Style)
# %%
print("\nDataset Info:")
df.info()
# %%
print("\nStatistical Summary:")
print(df.describe())
# %%
print("\nMissing Values Check:")
print(df.isnull().sum()) # Should be 0 for this dataset
print(f"\nAny NA values present: {df.isna().sum().any()}")
# %%
print("\nDuplicate Rows Check:")
print(f"Number of duplicated rows: {df.duplicated().sum()}") # Should be 0 for this dataset
# %%
print("\nDay with Minimum temp_min:")
print(df[df['temp_min']==min(df.temp_min)])
# %%
print("\nDay with Maximum temp_max:")
print(df[df['temp_max']==max(df.temp_max)])
# %%
# Define consistent bin edges for temperature histograms
temp_min_max = df[['temp_min', 'temp_max']].agg(['min', 'max']).values
all_temp_min = temp_min_max[0, 0]
all_temp_max = temp_min_max[0, 1] if temp_min_max[0, 1] > temp_min_max[1, 1] else temp_min_max[1, 1]
# Create bins with a width of 1 degree Celsius
bins = np.arange(np.floor(all_temp_min), np.ceil(all_temp_max) + 1, 1)
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_max', bins=bins, kde=True)
plt.title('Distribution of Maximum Temperature')
plt.xlabel('Max Temperature (°C)')
plt.ylabel('Frequency')
plt.xlim(bins.min(), bins.max()) # Set x-axis limits
plt.xticks(bins[::2]) # Show fewer ticks for clarity
plt.show()
# %%
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_min', bins=bins, kde=True)
plt.title('Distribution of Minimum Temperature')
plt.xlabel('Min Temperature (°C)')
plt.ylabel('Frequency')
plt.xlim(bins.min(), bins.max()) # Set x-axis limits
plt.xticks(bins[::2]) # Show fewer ticks for clarity
plt.show()
# %% [markdown]
# ### FacetGrid Visualizations (Month vs. Weather Variables by Year)
# First, convert 'date' to datetime and extract 'year' and 'month'.
# %%
def create_visualization_df(dataframe):
 """Creates a copy of the dataframe for visualization and extracts year/month."""
 df_vis = dataframe.copy()
 df_vis['year'] = df_vis['date'].dt.year
 df_vis['month'] = df_vis['date'].dt.month
 return df_vis
df_vis = create_visualization_df(df)
# %%
# Max Temperature vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_max', errorbar=None) # errorbar=None to remove confidence intervals for clarity
g.set_axis_labels('Month', 'Max Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Max Temperature by Month for Each Year', y=1.03) # Add a main title
plt.tight_layout()
plt.show()
# %%
# Min Temperature vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_min', errorbar=None)
g.set_axis_labels('Month', 'Min Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Min Temperature by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# %%
# Precipitation vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'precipitation', errorbar=None) # Lineplot might be better than scatter for trendsg.set_axis_labels('Month', 'Precipitation (mm)')
g.set_axis_labels('Month', 'Precipitation (mm)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Precipitation by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# %%
# Wind Speed vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'wind', errorbar=None) # Lineplot for trendsg.set_axis_labels('Month', 'Wind Speed')
g.set_axis_labels('Month', 'Wind Speed')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Wind Speed by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# %% [markdown]
# ### Weather Category Distribution
# %%
print("\nWeather Category Counts:")
weather_counts = df['weather'].value_counts()
print(weather_counts)
# %%
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='weather', order=weather_counts.index, hue='weather', palette="viridis", legend=False)
plt.title('Distribution of Weather Types')
plt.xlabel('Weather Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
# %%
plt.figure(figsize=(10, 8))
plt.pie(weather_counts, labels=weather_counts.index, autopct='%1.1f%%', startangle=140,
 colors=sns.color_palette("viridis", len(weather_counts)))
plt.title('Distribution of Weather Types (Pie Chart)')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
# %% [markdown]
# ## 3. Data Preprocessing for Classification
# The Kaggle notebook drops 'year' and 'month' after visualization and does not use 'date'.
# It then label encodes 'weather' for the target variable.
# %%
# Create a working copy of the dataframe for preprocessing
df_processed = df.copy()
# Drop the 'date' column as it won't be used directly as a feature in this specific approach.
# Note: For more advanced time-series models, date components or the date itself could be crucial.
# The Kaggle example's feature set is ['temp_min', 'temp_max', 'precipitation', 'wind'].
if 'date' in df_processed.columns:
 df_processed = df_processed.drop('date', axis=1)
print("\nDataFrame columns before modeling:", df_processed.columns.tolist())
df_processed.head()
# %%
# Label Encode the target variable 'weather'
le = LabelEncoder()
df_processed['weather_encoded'] = le.fit_transform(df_processed['weather'])
# Display the mapping
print("\nLabel Encoding Mapping for 'weather':")
for i, class_name in enumerate(le.classes_):
 print(f"{class_name} -> {i}")
# %%
# Save the label encoder for use in the Flask app (to decode predictions)
joblib.dump(le, 'weather_label_encoder.joblib')
print("\nSaved weather_label_encoder.joblib")
df_processed.head()
# %% [markdown]
# ### Adding Lagged Time Series Features
# We'll create features based on the previous day's observations to potentially improve model performance.
# %%
# Sort by date before creating lagged features to ensure correct order
df_for_lagged = df.sort_values(by='date').copy()
# Label Encode the target variable 'weather' for the lagged feature
le_lag = LabelEncoder()
df_for_lagged['weather_encoded'] = le_lag.fit_transform(df_for_lagged['weather'])
# Create lagged features
df_for_lagged['precipitation_lag1'] = df_for_lagged['precipitation'].shift(1)
df_for_lagged['temp_max_lag1'] = df_for_lagged['temp_max'].shift(1)
df_for_lagged['temp_min_lag1'] = df_for_lagged['temp_min'].shift(1)
df_for_lagged['wind_lag1'] = df_for_lagged['wind'].shift(1)
df_for_lagged['weather_encoded_lag1'] = df_for_lagged['weather_encoded'].shift(1)
# You can also add delta features
df_for_lagged['delta_max_temp'] = df_for_lagged['temp_max'] - df_for_lagged['temp_max_lag1']
df_for_lagged['delta_min_temp'] = df_for_lagged['temp_min'] - df_for_lagged['temp_min_lag1']
# Drop rows with NaN values introduced by shifting (first row)
df_for_lagged = df_for_lagged.dropna().reset_index(drop=True)
print("\nDataFrame with Lagged Features:")
print(df_for_lagged.head())
# Use this df for training with lagged features
df_processed_lagged = df_for_lagged.drop(columns=['date', 'weather'])
# %% [markdown]
# ## 4. Feature Selection and Train-Test Split
# %%
# Original features based on Kaggle example
original_features = ['temp_min', 'temp_max', 'precipitation', 'wind']
X_original = df_processed[original_features]
y_original = df_processed['weather_encoded']
# Features including lagged data
lagged_features = ['temp_min', 'temp_max', 'precipitation', 'wind',
 'precipitation_lag1', 'temp_max_lag1', 'temp_min_lag1',
 'wind_lag1', 'weather_encoded_lag1',
 'delta_max_temp', 'delta_min_temp']
X_lagged = df_processed_lagged[lagged_features]
y_lagged = df_processed_lagged['weather_encoded'] # Target remains the same
# Store the feature names model will be trained on (for Flask app input)
# We will use the original features for the primary model saved for Flask
feature_names_for_model = X_original.columns.tolist()
joblib.dump(feature_names_for_model, 'classifier_feature_names.joblib')
print(f"Saved classifier_feature_names.joblib with features: {feature_names_for_model}")
# Split data - using random split as per Kaggle example for primary model
# stratify=y is good for imbalanced classes
X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(
 X_original, y_original, test_size=0.2, random_state=42, stratify=y_original
)
print(f"\nOriginal X_train shape: {X_train_original.shape}, y_train shape: {y_train_original.shape}")
print(f"Original X_test shape: {X_test_original.shape}, y_test shape: {y_test_original.shape}")
# Split data for lagged features
X_train_lagged, X_test_lagged, y_train_lagged, y_test_lagged = train_test_split(
 X_lagged, y_lagged, test_size=0.2, random_state=42, stratify=y_lagged
)
print(f"\nLagged X_train shape: {X_train_lagged.shape}, y_train shape: {y_train_lagged.shape}")
print(f"Lagged X_test shape: {X_test_lagged.shape}, y_test shape: {y_test_lagged.shape}")
# %% [markdown]
# ## 5. Naïve Model (Climate Prediction)
# A simple baseline model that predicts the most frequent weather type for each month.
# %%
# Extract month from date for the naive model
df_naive = df.copy()
df_naive['month'] = df_naive['date'].dt.month
# Determine the most frequent weather type for each month
monthly_most_frequent_weather = df_naive.groupby('month')['weather'].agg(lambda x: x.mode()[0])
print("\nMost frequent weather type per month (Naïve Model):")
print(monthly_most_frequent_weather)
# Evaluate the naive model
# To do this properly, we'd need to simulate predictions for each day and compare
# For simplicity, let's just see what the overall accuracy would be if we predicted the most
# frequent weather type for each month on the entire dataset.
# This isn't a true test-set evaluation but gives an idea of a very simple baseline.
df_naive['predicted_weather_naive'] = df_naive['month'].map(monthly_most_frequent_weather)
naive_accuracy = accuracy_score(df_naive['weather'], df_naive['predicted_weather_naive'])
print(f"\nNaïve Model Accuracy (Predicting most frequent weather by month): {naive_accuracy:.4f}")
print("This simple model predicts the 'climate' for each month, rather than specific 'weather'.")
# ## 6. Model Training and Evaluation
# %%
def train_and_evaluate_model(model, X_train, y_train, X_test, y_test, target_names, model_name="Model"):
 """Trains a model and prints evaluation metrics."""
 print(f"\n--- {model_name} Training and Evaluation ---")
 model.fit(X_train, y_train)
 y_pred = model.predict(X_test)
 accuracy = accuracy_score(y_test, y_pred)
 conf_matrix = confusion_matrix(y_test, y_pred)
 classification_rep = classification_report(y_test, y_pred, target_names=target_names, zero_division=0)
 print(f"\nAccuracy: {accuracy:.4f}")
 print("\nConfusion Matrix:")
 plt.figure(figsize=(8, 6))
 sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
 plt.xlabel('Predicted Label')
 plt.ylabel('True Label')
 plt.title(f'Confusion Matrix - {model_name}')
 plt.show()
 print("\nClassification Report:")
 print(classification_rep)
 return model, accuracy
# --- Gaussian Naive Bayes (Original Features) ---
nb_model_original, nb_accuracy_original = train_and_evaluate_model(
 GaussianNB(), X_train_original, y_train_original, X_test_original, y_test_original,
 le.classes_, "Gaussian Naive Bayes (Original Features)"
)
# --- Gaussian Naive Bayes (Lagged Features) ---
nb_model_lagged, nb_accuracy_lagged = train_and_evaluate_model(
 GaussianNB(), X_train_lagged, y_train_lagged, X_test_lagged, y_test_lagged,
 le_lag.classes_, "Gaussian Naive Bayes (Lagged Features)"
)
# --- Logistic Regression (Original Features) ---
lr_model_original, lr_accuracy_original = train_and_evaluate_model(
 LogisticRegression(max_iter=1000, random_state=42), X_train_original, y_train_original, X_test_original, y_test_original,
 le.classes_, "Logistic Regression (Original Features)"
)
# --- Support Vector Machine (Original Features) ---
# For SVM, using a linear kernel for simplicity and speed. RBF is also common but can be slower.
# Adjust 'C' for regularization if needed.
svm_model_original, svm_accuracy_original = train_and_evaluate_model(
 SVC(random_state=42), X_train_original, y_train_original, X_test_original, y_test_original,
 le.classes_, "Support Vector Machine (Original Features)"
)
# %% [markdown]
# ## 7. Ablation Study
# Let's see how different features contribute to the model's performance (using Gaussian Naive Bayes with original features).
# %%
features_to_ablate = [
 ['temp_min', 'temp_max', 'precipitation', 'wind'],
 ['temp_min', 'temp_max', 'precipitation'],
 ['temp_min', 'temp_max', 'wind'],
 ['precipitation', 'wind'],
 ['temp_max'],
 ['wind']
]
ablation_results = {}
print("\n--- Ablation Study (Gaussian Naive Bayes with Original Features) ---")
for i, current_features in enumerate(features_to_ablate):
 print(f"\nTraining with features: {current_features}")
 X_ablation = df_processed[current_features]
 y_ablation = df_processed['weather_encoded']
 X_train_ab, X_test_ab, y_train_ab, y_test_ab = train_test_split(
 X_ablation, y_ablation, test_size=0.2, random_state=42, stratify=y_ablation
 )
 model = GaussianNB()
 model.fit(X_train_ab, y_train_ab)
 y_pred_ab = model.predict(X_test_ab)
 accuracy_ab = accuracy_score(y_test_ab, y_pred_ab)
 ablation_results[tuple(current_features)] = accuracy_ab
 print(f"Accuracy: {accuracy_ab:.4f}")
print("\n--- Ablation Study Summary ---")
for features, acc in ablation_results.items():
 print(f"Features: {features} -> Accuracy: {acc:.4f}")
# %% [markdown]
# ## 8. Save the Model for Flask App
# We'll save the best performing model (or the original Gaussian Naive Bayes as initially planned) and the label encoder.
# For demonstration, we'll save the original Gaussian Naive Bayes model.
# %%
# Save the Gaussian Naive Bayes model (using original features)
joblib.dump(nb_model_original, 'weather_prediction_model.joblib')
print("\nSaved weather_prediction_model.joblib (Gaussian Naive Bayes with original features)")
print("\nNotebook execution complete!")

The code that I have produced here is based on feedback that I received from my previous question regarding this project. Here is the link: ML Project on Predicting Weather App

I just want to know if I have managed to implement the feedback accurately and I am doing everything in the right way.

Question 2

Again, this looks good.

Burying the to_datetime() call down in the loading helper is nice, it keeps things organized.

Nice use of the "viridis" palette.

tuple unpack

This seems slightly inconvenient.

temp_min_max = df[['temp_min', 'temp_max']].agg(['min', 'max']).values
all_temp_min = temp_min_max[0, 0]
all_temp_max = temp_min_max[0, 1] if temp_min_max[0, 1] > temp_min_max[1, 1] else temp_min_max[1, 1]

The {0, 1} subscripts are clear enough, but we could make this easier to read.

High level goal: We generally prefer to name things rather than use cryptic indexes like [1]. For example given a Point p, prefer to unpack x, y = p, or refer to p.x and p.y, instead of p[0] and p[1].

Here we might assign t_min, t_max = df[ ... ].agg(['min', 'max']).values, except the meaning of min / max has become ambiguous. Probably better to aggregate twice:

t_min = min(pd.concat([df.temp_min, df.temp_max]))
t_max = max(pd.concat([df.temp_min, df.temp_max]))

One could use numpy's min() / max() for the same effect. It's probably clearer if we use two source code lines. More scans, but hey, they're cheap.

In any event, computing the overall min and max looks good. It's much better than hardcoded bounds, as now this code can be applied to other datasets.

lagged features

The "Create lagged features" section works, but it includes a tedious amount of copy-n-paste. Bury this logic in a helper function, so temps like the lagged dataframe will go out-of-scope. (Maybe call it lagged_df, or df_lagged?)

The helper's signature can have a keyword default of ... , days_lagged=1):

Those five similar lines are just crying out for a loop:

 cols = [
 'precipitation',
 'temp_max',
 'temp_min',
 'wind',
 'weather_encoded',
 ]
 for col in cols:
 df_for_lagged[f'{col}_lag{days_lagged}'] = df_for_lagged[col].shift(days_lagged)
 return df_for_lagged

Prefer an identifier of le_lagged, in the interest of consistency.

Extra credit:

You might consider training a Random Forest classifier on all this data. You've already taken a stab at seeing how e.g. an SVM stacks up against other models, and at determining which features are the most informative.

There are automated techniques for identifying how informative features are, and one of the simplest to use is Random Forest. It is constrained to choose only a subset of the offered features, so it will naturally prune decision trees which rely on uninformative features.

It's a very interpretable modeling technique, easier to explain to stake holders than e.g. the separating hyper-plane of an SVM. Just look at the top decision node: that's the best feature, the one that explains most of the model's performance. Then look at features used by child nodes to see what the next most interesting features are. This can help guide feature engineering, such as the temperature deltas you added, and can help prioritize funding sensors that will actually make a difference, such as installing new thermometers or anemometers.

Question 3

I have implemented your feedback in a new jupyter notebook file. Would you please review it? github.com/SteveAustin583/weather-prediction-ml/blob/main/…

Question 4

I have created a separate question. codereview.stackexchange.com/questions/297202/… Please inform me if I have managed to implement your feedback appropriately.

Question 5

DRY

I see this pattern throughout the code, where you print a newline followed by some text, followed by a colon:

print("\nDataset Info:")

You could create a simple function for that:

def print_header(message):
 print(f"\n{message}:")

This allows you flexibility if you want to change the format of these output lines.

Layout

Move the functions to the top after the import lines. Having them in the middle of the code interrupts the natural flow of the code (from a human readability standpoint).

Comments

The comments in the code are helpful. You might consider removing the numbering since it will be hard to maintain if you need to add or remove steps:

# ## 7. Ablation Study

Question 6

I have implemented your feedback, along with J_H's feedback, in a new jupyter notebook file. Would you please take a look at it? Have I implemented your feedback appropriately? github.com/SteveAustin583/weather-prediction-ml/blob/main/…

Question 7

@SteveAustin: I took a look at the code in github, and it looks like you have implemented the feedback properly. I am unfamiliar with jupyter, so I don't understand how the jupyter code maps to the code you post in the question, but it looks fine.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2025-05-25 20:01:08Z

Again, this looks good.

Burying the to_datetime() call down in the loading helper is nice, it keeps things organized.

Nice use of the "viridis" palette.

tuple unpack

This seems slightly inconvenient.

temp_min_max = df[['temp_min', 'temp_max']].agg(['min', 'max']).values
all_temp_min = temp_min_max[0, 0]
all_temp_max = temp_min_max[0, 1] if temp_min_max[0, 1] > temp_min_max[1, 1] else temp_min_max[1, 1]

The {0, 1} subscripts are clear enough, but we could make this easier to read.

High level goal: We generally prefer to name things rather than use cryptic indexes like [1]. For example given a Point p, prefer to unpack x, y = p, or refer to p.x and p.y, instead of p[0] and p[1].

Here we might assign t_min, t_max = df[ ... ].agg(['min', 'max']).values, except the meaning of min / max has become ambiguous. Probably better to aggregate twice:

t_min = min(pd.concat([df.temp_min, df.temp_max]))
t_max = max(pd.concat([df.temp_min, df.temp_max]))

One could use numpy's min() / max() for the same effect. It's probably clearer if we use two source code lines. More scans, but hey, they're cheap.

In any event, computing the overall min and max looks good. It's much better than hardcoded bounds, as now this code can be applied to other datasets.

lagged features

The "Create lagged features" section works, but it includes a tedious amount of copy-n-paste. Bury this logic in a helper function, so temps like the lagged dataframe will go out-of-scope. (Maybe call it lagged_df, or df_lagged?)

The helper's signature can have a keyword default of ... , days_lagged=1):

Those five similar lines are just crying out for a loop:

 cols = [
 'precipitation',
 'temp_max',
 'temp_min',
 'wind',
 'weather_encoded',
 ]
 for col in cols:
 df_for_lagged[f'{col}_lag{days_lagged}'] = df_for_lagged[col].shift(days_lagged)
 return df_for_lagged

Prefer an identifier of le_lagged, in the interest of consistency.

Extra credit:

You might consider training a Random Forest classifier on all this data. You've already taken a stab at seeing how e.g. an SVM stacks up against other models, and at determining which features are the most informative.

There are automated techniques for identifying how informative features are, and one of the simplest to use is Random Forest. It is constrained to choose only a subset of the offered features, so it will naturally prune decision trees which rely on uninformative features.

It's a very interpretable modeling technique, easier to explain to stake holders than e.g. the separating hyper-plane of an SVM. Just look at the top decision node: that's the best feature, the one that explains most of the model's performance. Then look at features used by child nodes to see what the next most interesting features are. This can help guide feature engineering, such as the temperature deltas you added, and can help prioritize funding sensors that will actually make a difference, such as installing new thermometers or anemometers.

I have implemented your feedback in a new jupyter notebook file. Would you please review it? github.com/SteveAustin583/weather-prediction-ml/blob/main/…
I have created a separate question. codereview.stackexchange.com/questions/297202/… Please inform me if I have managed to implement your feedback appropriately.

Stack Exchange Network

ML Project - Weather Prediction on Jupyter Notebook

2 Answers 2

tuple unpack

lagged features

DRY

Layout

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

2 Answers 2

tuple unpack

lagged features

Layout

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related