I am working on a basic, personal ML project on predicting weather. First, I am working on Jupyter Notebook. Then I will create a Flask app.
I have just completed the code on Jupyter Notebook. Would you please review my code on Github to see if I am doing everything right?
https://github.com/SteveAustin583/weather-prediction-ml/blob/main/weather-prediction-ml.ipynb
Here is my code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import io # To load CSV from string in this environment
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib
# Display plots inline
%matplotlib inline
# Set some display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
# %%
# --- Load the dataset ---
# In a real scenario, you would use:
file_path = 'seattle-weather.csv'
df = pd.read_csv(file_path)
print("Dataset loaded successfully:")
df.head()
# ## 2. Initial Data Exploration & Visualization (Kaggle Style)
# %%
print("\nDataset Info:")
df.info()
# %%
print("\nStatistical Summary:")
print(df.describe())
# %%
print("\nMissing Values Check:")
print(df.isnull().sum()) # Should be 0 for this dataset
print(f"\nAny NA values present: {df.isna().sum().any()}")
# %%
print("\nDuplicate Rows Check:")
print(f"Number of duplicated rows: {df.duplicated().sum()}") # Should be 0 for this dataset
# %%
print("\nDay with Minimum temp_min:")
print(df[df['temp_min']==min(df.temp_min)])
# %%
print("\nDay with Maximum temp_max:")
print(df[df['temp_max']==max(df.temp_max)])
# %%
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_max', bins=20, kde=True)
plt.title('Distribution of Maximum Temperature')
plt.xlabel('Max Temperature (°C)')
plt.ylabel('Frequency')
plt.show()
# %%
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_min', bins=20, kde=True)
plt.title('Distribution of Minimum Temperature')
plt.xlabel('Min Temperature (°C)')
plt.ylabel('Frequency')
plt.show()
# ### FacetGrid Visualizations (Month vs. Weather Variables by Year)
# First, convert 'date' to datetime and extract 'year' and 'month'.
# %%
df_vis = df.copy() # Create a copy for visualization to keep original df clean for now
df_vis['date'] = pd.to_datetime(df_vis['date'])
df_vis['year'] = df_vis['date'].dt.year
df_vis['month'] = df_vis['date'].dt.month
# Max Temperature vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_max', errorbar=None) # errorbar=None to remove confidence intervals for clarity
g.set_axis_labels('Month', 'Max Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Max Temperature by Month for Each Year', y=1.03) # Add a main title
plt.tight_layout()
plt.show()
# Min Temperature vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_min', errorbar=None)
g.set_axis_labels('Month', 'Min Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Min Temperature by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# Precipitation vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'precipitation', errorbar=None) # Lineplot might be better than scatter for trends
g.set_axis_labels('Month', 'Precipitation (mm)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Precipitation by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# Wind Speed vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'wind', errorbar=None) # Lineplot for trends
g.set_axis_labels('Month', 'Wind Speed')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Wind Speed by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# ### Weather Category Distribution
# %%
print("\nWeather Category Counts:")
weather_counts = df['weather'].value_counts()
print(weather_counts)
# %%
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='weather', order=weather_counts.index, palette="viridis")
plt.title('Distribution of Weather Types')
plt.xlabel('Weather Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
# %%
plt.figure(figsize=(10, 8))
plt.pie(weather_counts, labels=weather_counts.index, autopct='%1.1f%%', startangle=140,
colors=sns.color_palette("viridis", len(weather_counts)))
plt.title('Distribution of Weather Types (Pie Chart)')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
# Drop the 'date' column as it won't be used directly as a feature in this specific approach.
# Note: For more advanced time-series models, date components or the date itself could be crucial.
# The Kaggle example's feature set is ['temp_min', 'temp_max', 'precipitation', 'wind'].
if 'date' in df.columns:
df = df.drop('date', axis=1)
print("\nDataFrame columns before modeling:", df.columns.tolist())
df.head()
# %%
# Label Encode the target variable 'weather'
le = LabelEncoder()
df['weather_encoded'] = le.fit_transform(df['weather'])
# Display the mapping
print("\nLabel Encoding Mapping for 'weather':")
for i, class_name in enumerate(le.classes_):
print(f"{class_name} -> {i}")
# Save the label encoder for use in the Flask app (to decode predictions)
joblib.dump(le, 'weather_label_encoder.joblib')
print("\nSaved weather_label_encoder.joblib")
df.head()
# ## 4. Feature Selection and Train-Test Split
# Based on the Kaggle example, features are 'temp_min', 'temp_max', 'precipitation', 'wind'.
# Target is the encoded 'weather'.
# %%
X = df[['temp_min', 'temp_max', 'precipitation', 'wind']]
y = df['weather_encoded'] # Use the numerically encoded weather column
# Store the feature names model will be trained on (for Flask app input)
feature_names_for_model = X.columns.tolist()
joblib.dump(feature_names_for_model, 'classifier_feature_names.joblib')
print(f"Saved classifier_feature_names.joblib with features: {feature_names_for_model}")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # stratify=y is good for imbalanced classes
print(f"\nX_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
# ## 5. Model Training (Gaussian Naive Bayes)
# %%
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
print("Gaussian Naive Bayes model trained.")
# New Line Added to Save the trained nb_model, which is the most crucial piece for your Flask app
joblib.dump(nb_model, 'weather_predictor_model.joblib')
print("\nSaved weather_predictor_model.joblib")
# ## 6. Model Evaluation
# %%
y_pred = nb_model.predict(X_test)
y_pred_proba = nb_model.predict_proba(X_test) # Get probabilities for each class
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
# Use le.classes_ to get original string labels for the classification report
classification_rep = classification_report(y_test, y_pred, target_names=le.classes_, zero_division=0)
print(f"\nAccuracy: {accuracy:.4f}") # Increased precision for accuracy
print("\nConfusion Matrix:")
# For better visualization of confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
print("\nClassification Report:")
print(classification_rep)
You can find the dataset of my ML project here on Kaggle: https://www.kaggle.com/datasets/ananthr1/weather-prediction
-
1\$\begingroup\$ Ultra minor quibble: Ask Google the difference between "predicting" and "forecasting". \$\endgroup\$Fe2O3– Fe2O32025年05月24日 22:07:48 +00:00Commented May 24 at 22:07
-
1\$\begingroup\$ @Fe2O3 A forecast is like a scientific projection with a solid basis and a recognition of uncertainty. On the other hand, a prediction can be anything from an educated guess to a highly informed statement, but it doesn't always carry the same expectation of rigorous methodology or probabilistic outcome as a forecast. \$\endgroup\$Steve Austin– Steve Austin2025年05月24日 23:00:29 +00:00Commented May 24 at 23:00
-
1\$\begingroup\$ I'm "way out of my depth" here. Just seemed to this layman that the purpose aligned more with forecasting than "consulting the chicken gizzards" to make a prediction... Cheers! :-) \$\endgroup\$Fe2O3– Fe2O32025年05月24日 23:09:33 +00:00Commented May 24 at 23:09
-
1\$\begingroup\$ Here you may find other SE communities where your interests could reach more users whose interests are closely aligned with your pursuits. Cross Validated and/or Data Science?? Welcome to machine! \$\endgroup\$Fe2O3– Fe2O32025年05月25日 00:06:34 +00:00Commented May 25 at 0:06
1 Answer 1
Looks good. I didn't notice any data leakage.
future warning
Passing
palette
without assigninghue
is deprecated
That should be an easy one to tidy up; just follow what the diagnostic advises you to do.
evaluation
Consider putting together a naïve model. Yes, yes, I know you have a GaussianNB model. I'm talking about something even simpler, a Farmer's Almanac type of thing, which can be printed years in advance.
Your model accepts today's wind
and some
other observations, and outputs whether today is sunny.
What if the model was "blind" to such observations,
and could only make a prediction based on today's date?
Essentially predicting climate rather than weather.
That would give you a better basis for identifying what the GaussianNB model had learned from the weather station observations.
ablation study
More generally, you can strip individual features,
or groups of features such as temperature,
from the model.
This lets you identify how adding e.g. wind
or max_temp
improves model performance.
Imagine you were going to set up weather station(s)
near Vancouver.
Would it be worth spending money on an anemometer
to gather the wind
feature?
Or is it mostly uninformative and you're better off
spending the money just on temperature sensors?
lagged time series
Models thrive on reading lots of data, and we can generate additional feature data from the original spreadsheet. You have four years of daily observations, which is very nice. At inference time the model is told of today's observations, only.
In a more realistic setting we might know what the last few days of weather have been, and we're asked to predict whether today is sunny. Essentially, "do the observations predict a change from yesterday's sunniness?"
To accomplish this you can produce a 1-day lagged
timeseries having features you've found to
be informative, such as yesterday's temperature,
and having a column for whether yesterday was sunny.
No, this isn't data leakage.
You might also find that some modeling techniques
do better if you digest the features slightly,
like by adding a delta_max_temp
column which
shows how many degrees warmer it is today compared
to yesterday.
Adding in 2-day lagged features might also produce a little lift in model performance.
alternate modeling techniques
The great thing about sklearn
is it offers so many
models that expose the same Public API.
Consider pitting your current model against
a newly trained SVM, or perhaps a logistic regressor.
globals
Jupyter notebooks can be very convenient -- they let you tinker without re-reading a giant dataset or having to spend time re-training. But the downside is you tend to wind up with tons of module-level variables, which produces undesirably high coupling.
Consider using def
to define helper functions
which produce e.g. df_vis
.
The nice thing about a return df_vis
statement
is that all of the function's local temp variables
disappear once they're out of scope.
So you don't have to spend brain cycles worrying
about them any more.
While reading your notebook I found myself vertically scrolling quite a bit, asking what the provenance of this or that variable was, "where did it come from?" and "was aggregation applied to it?".
bin labels
Here is a display nit. The "Distribution of Minimum Temperature" chart is easy to read. Each bin width is clearly 1.25 °C.
The corresponding chart for temp_max
is harder
to read than it needs to be.
Consider hard coding the limits so each bin
will have width of 1 °C or a similarly convenient figure.
While you're at it, consider applying same hard coded limits to both charts, so they're directly comparable when viewing them next to each other.
BTW, kudos on carefully labeling everything displayed.
-
\$\begingroup\$ Thank you. I will make the necessary adjustments that you have suggested and get back to you. \$\endgroup\$Steve Austin– Steve Austin2025年05月24日 19:05:00 +00:00Commented May 24 at 19:05
-
\$\begingroup\$ I have tried implementing your feedback in a separate jupyter notebook file. Would you please review my effort? github.com/SteveAustin583/weather-prediction-ml/blob/main/… \$\endgroup\$Steve Austin– Steve Austin2025年05月24日 19:28:52 +00:00Commented May 24 at 19:28
-
\$\begingroup\$ I have created a separate question. codereview.stackexchange.com/questions/297195/… Let me know if I have managed to implement your feedback correctly. \$\endgroup\$Steve Austin– Steve Austin2025年05月25日 16:50:38 +00:00Commented May 25 at 16:50