4
\$\begingroup\$

I am working on a basic, personal ML project on predicting weather. First, I am working on Jupyter Notebook. Then I will create a Flask app.

I have just completed the code on Jupyter Notebook. Would you please review my code on Github to see if I am doing everything right?

https://github.com/SteveAustin583/weather-prediction-ml/blob/main/weather-prediction-ml.ipynb

Here is my code:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import io # To load CSV from string in this environment
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib
# Display plots inline
%matplotlib inline
# Set some display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
# %%
# --- Load the dataset ---
# In a real scenario, you would use:
file_path = 'seattle-weather.csv'
df = pd.read_csv(file_path)
print("Dataset loaded successfully:")
df.head()
# ## 2. Initial Data Exploration & Visualization (Kaggle Style)
# %%
print("\nDataset Info:")
df.info()
# %%
print("\nStatistical Summary:")
print(df.describe())
# %%
print("\nMissing Values Check:")
print(df.isnull().sum()) # Should be 0 for this dataset
print(f"\nAny NA values present: {df.isna().sum().any()}")
# %%
print("\nDuplicate Rows Check:")
print(f"Number of duplicated rows: {df.duplicated().sum()}") # Should be 0 for this dataset
# %%
print("\nDay with Minimum temp_min:")
print(df[df['temp_min']==min(df.temp_min)])
# %%
print("\nDay with Maximum temp_max:")
print(df[df['temp_max']==max(df.temp_max)])
# %%
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_max', bins=20, kde=True)
plt.title('Distribution of Maximum Temperature')
plt.xlabel('Max Temperature (°C)')
plt.ylabel('Frequency')
plt.show()
# %%
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_min', bins=20, kde=True)
plt.title('Distribution of Minimum Temperature')
plt.xlabel('Min Temperature (°C)')
plt.ylabel('Frequency')
plt.show()
# ### FacetGrid Visualizations (Month vs. Weather Variables by Year)
# First, convert 'date' to datetime and extract 'year' and 'month'.
# %%
df_vis = df.copy() # Create a copy for visualization to keep original df clean for now
df_vis['date'] = pd.to_datetime(df_vis['date'])
df_vis['year'] = df_vis['date'].dt.year
df_vis['month'] = df_vis['date'].dt.month
# Max Temperature vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_max', errorbar=None) # errorbar=None to remove confidence intervals for clarity
g.set_axis_labels('Month', 'Max Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Max Temperature by Month for Each Year', y=1.03) # Add a main title
plt.tight_layout()
plt.show()
# Min Temperature vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_min', errorbar=None)
g.set_axis_labels('Month', 'Min Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Min Temperature by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# Precipitation vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'precipitation', errorbar=None) # Lineplot might be better than scatter for trends
g.set_axis_labels('Month', 'Precipitation (mm)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Precipitation by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# Wind Speed vs. Month by Year
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'wind', errorbar=None) # Lineplot for trends
g.set_axis_labels('Month', 'Wind Speed')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Wind Speed by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# ### Weather Category Distribution
# %%
print("\nWeather Category Counts:")
weather_counts = df['weather'].value_counts()
print(weather_counts)
# %%
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='weather', order=weather_counts.index, palette="viridis")
plt.title('Distribution of Weather Types')
plt.xlabel('Weather Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
# %%
plt.figure(figsize=(10, 8))
plt.pie(weather_counts, labels=weather_counts.index, autopct='%1.1f%%', startangle=140,
 colors=sns.color_palette("viridis", len(weather_counts)))
plt.title('Distribution of Weather Types (Pie Chart)')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
# Drop the 'date' column as it won't be used directly as a feature in this specific approach.
# Note: For more advanced time-series models, date components or the date itself could be crucial.
# The Kaggle example's feature set is ['temp_min', 'temp_max', 'precipitation', 'wind'].
if 'date' in df.columns:
 df = df.drop('date', axis=1)
print("\nDataFrame columns before modeling:", df.columns.tolist())
df.head()
# %%
# Label Encode the target variable 'weather'
le = LabelEncoder()
df['weather_encoded'] = le.fit_transform(df['weather'])
# Display the mapping
print("\nLabel Encoding Mapping for 'weather':")
for i, class_name in enumerate(le.classes_):
 print(f"{class_name} -> {i}")
# Save the label encoder for use in the Flask app (to decode predictions)
joblib.dump(le, 'weather_label_encoder.joblib')
print("\nSaved weather_label_encoder.joblib")
df.head()
# ## 4. Feature Selection and Train-Test Split
# Based on the Kaggle example, features are 'temp_min', 'temp_max', 'precipitation', 'wind'.
# Target is the encoded 'weather'.
# %%
X = df[['temp_min', 'temp_max', 'precipitation', 'wind']]
y = df['weather_encoded'] # Use the numerically encoded weather column
# Store the feature names model will be trained on (for Flask app input)
feature_names_for_model = X.columns.tolist()
joblib.dump(feature_names_for_model, 'classifier_feature_names.joblib')
print(f"Saved classifier_feature_names.joblib with features: {feature_names_for_model}")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # stratify=y is good for imbalanced classes
print(f"\nX_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
# ## 5. Model Training (Gaussian Naive Bayes)
# %%
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
print("Gaussian Naive Bayes model trained.")
# New Line Added to Save the trained nb_model, which is the most crucial piece for your Flask app
joblib.dump(nb_model, 'weather_predictor_model.joblib')
print("\nSaved weather_predictor_model.joblib")
# ## 6. Model Evaluation
# %%
y_pred = nb_model.predict(X_test)
y_pred_proba = nb_model.predict_proba(X_test) # Get probabilities for each class
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
# Use le.classes_ to get original string labels for the classification report
classification_rep = classification_report(y_test, y_pred, target_names=le.classes_, zero_division=0)
print(f"\nAccuracy: {accuracy:.4f}") # Increased precision for accuracy
print("\nConfusion Matrix:")
# For better visualization of confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
print("\nClassification Report:")
print(classification_rep)

You can find the dataset of my ML project here on Kaggle: https://www.kaggle.com/datasets/ananthr1/weather-prediction

asked May 24 at 16:52
\$\endgroup\$
4
  • 1
    \$\begingroup\$ Ultra minor quibble: Ask Google the difference between "predicting" and "forecasting". \$\endgroup\$ Commented May 24 at 22:07
  • 1
    \$\begingroup\$ @Fe2O3 A forecast is like a scientific projection with a solid basis and a recognition of uncertainty. On the other hand, a prediction can be anything from an educated guess to a highly informed statement, but it doesn't always carry the same expectation of rigorous methodology or probabilistic outcome as a forecast. \$\endgroup\$ Commented May 24 at 23:00
  • 1
    \$\begingroup\$ I'm "way out of my depth" here. Just seemed to this layman that the purpose aligned more with forecasting than "consulting the chicken gizzards" to make a prediction... Cheers! :-) \$\endgroup\$ Commented May 24 at 23:09
  • 1
    \$\begingroup\$ Here you may find other SE communities where your interests could reach more users whose interests are closely aligned with your pursuits. Cross Validated and/or Data Science?? Welcome to machine! \$\endgroup\$ Commented May 25 at 0:06

1 Answer 1

5
\$\begingroup\$

Looks good. I didn't notice any data leakage.

future warning

Passing palette without assigning hue is deprecated

That should be an easy one to tidy up; just follow what the diagnostic advises you to do.

evaluation

Consider putting together a naïve model. Yes, yes, I know you have a GaussianNB model. I'm talking about something even simpler, a Farmer's Almanac type of thing, which can be printed years in advance.

Your model accepts today's wind and some other observations, and outputs whether today is sunny. What if the model was "blind" to such observations, and could only make a prediction based on today's date? Essentially predicting climate rather than weather.

That would give you a better basis for identifying what the GaussianNB model had learned from the weather station observations.

ablation study

More generally, you can strip individual features, or groups of features such as temperature, from the model. This lets you identify how adding e.g. wind or max_temp improves model performance.

Imagine you were going to set up weather station(s) near Vancouver. Would it be worth spending money on an anemometer to gather the wind feature? Or is it mostly uninformative and you're better off spending the money just on temperature sensors?

lagged time series

Models thrive on reading lots of data, and we can generate additional feature data from the original spreadsheet. You have four years of daily observations, which is very nice. At inference time the model is told of today's observations, only.

In a more realistic setting we might know what the last few days of weather have been, and we're asked to predict whether today is sunny. Essentially, "do the observations predict a change from yesterday's sunniness?"

To accomplish this you can produce a 1-day lagged timeseries having features you've found to be informative, such as yesterday's temperature, and having a column for whether yesterday was sunny. No, this isn't data leakage. You might also find that some modeling techniques do better if you digest the features slightly, like by adding a delta_max_temp column which shows how many degrees warmer it is today compared to yesterday.

Adding in 2-day lagged features might also produce a little lift in model performance.

alternate modeling techniques

The great thing about sklearn is it offers so many models that expose the same Public API. Consider pitting your current model against a newly trained SVM, or perhaps a logistic regressor.

globals

Jupyter notebooks can be very convenient -- they let you tinker without re-reading a giant dataset or having to spend time re-training. But the downside is you tend to wind up with tons of module-level variables, which produces undesirably high coupling.

Consider using def to define helper functions which produce e.g. df_vis. The nice thing about a return df_vis statement is that all of the function's local temp variables disappear once they're out of scope. So you don't have to spend brain cycles worrying about them any more.

While reading your notebook I found myself vertically scrolling quite a bit, asking what the provenance of this or that variable was, "where did it come from?" and "was aggregation applied to it?".

bin labels

Here is a display nit. The "Distribution of Minimum Temperature" chart is easy to read. Each bin width is clearly 1.25 °C.

The corresponding chart for temp_max is harder to read than it needs to be. Consider hard coding the limits so each bin will have width of 1 °C or a similarly convenient figure.

While you're at it, consider applying same hard coded limits to both charts, so they're directly comparable when viewing them next to each other.

BTW, kudos on carefully labeling everything displayed.

answered May 24 at 18:14
\$\endgroup\$
3
  • \$\begingroup\$ Thank you. I will make the necessary adjustments that you have suggested and get back to you. \$\endgroup\$ Commented May 24 at 19:05
  • \$\begingroup\$ I have tried implementing your feedback in a separate jupyter notebook file. Would you please review my effort? github.com/SteveAustin583/weather-prediction-ml/blob/main/… \$\endgroup\$ Commented May 24 at 19:28
  • \$\begingroup\$ I have created a separate question. codereview.stackexchange.com/questions/297195/… Let me know if I have managed to implement your feedback correctly. \$\endgroup\$ Commented May 25 at 16:50

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.