HDBSCAN Interpretation and Logic

Question 1

I have made a basic HDBSCAN model (picture output below) but I need to figure out names for the individual clusters. Is there a way I can get something like a decision tree or the parameters for each cluster as that is what I need to classify these clusters properly. For example: category -1 has to have COMP_2_Pass = 1 and REBOUND_3 < -2300 or something that shows filtering logic like this. Thanks

GRAPH

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import umap.umap_ as umap
import hdbscan
from sklearn.preprocessing import StandardScaler
#Drop label column
X = df_cleaned.drop(columns=['Failure_area'])
#Scale features
X_scaled = StandardScaler().fit_transform(X)
#UMAP for 2D projection
reducer = umap.UMAP(random_state=42)
X_umap = reducer.fit_transform(X_scaled)
# HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(min_cluster_size=40, min_samples= 55)
cluster_labels = clusterer.fit_predict(X_scaled)
# Add results to DataFrame
df_clustered = df_cleaned.copy()
df_clustered['HDBSCAN_cluster'] = cluster_labels
from matplotlib.colors import ListedColormap
# Get unique cluster labels
unique_labels = np.unique(cluster_labels)
n_clusters = len(unique_labels)
# Create a distinct color map with one color per cluster
# Assign gray to noise cluster
colors = ['#0096FF'] + [plt.cm.tab20(i) for i in range(1, n_clusters)]
cmap = ListedColormap(colors)
#cmap = plt.cm.get_cmap('tab20', n_clusters) # Use tab20, tab10, or other qualitative colormaps
# Plot
plt.figure(figsize=(10, 7))
scatter = plt.scatter(
 X_umap[:, 0], X_umap[:, 1],
 c=cluster_labels,
 cmap=cmap,
 s=10
)
plt.title("UMAP + HDBSCAN Clustering of Entire Dataset")
plt.xlabel("UMAP1")
plt.ylabel("UMAP2")
plt.grid(True)
# Add colorbar with correct ticks
cbar = plt.colorbar(scatter, ticks=unique_labels)
cbar.set_label('Cluster ID')
plt.show()
print(df_clustered['HDBSCAN_cluster'].value_counts().sort_values(ascending=False))

Question 2

Since this is a 2D plot, why not just name each one by the centroid of the cluster? For example, cluster 0 in the sample image could be (-7, 1), cluster 2 could be (9, 0) and so on.

Question 3

It is not about just naming them, I am trying to classify items by how they perform in a dataset, and this may have done that for me. I need to see how it is classifying these to verify that

Question 4

You could aggregate the samples by cluster label, and use that aggregation to report the mean of each feature per cluster. Also consider things like how variable each feature is per cluster (i.e. report IQR or CoV as well as the mean/median etc).

For a toy dataset, I produce this visualisation coloured by cluster label:

enter image description here

I add the cluster labels to the original dataframe, and then visualise the median feature value per cluster. I am using the same axis scaling across clusters in order to easily compare between them.

enter image description here

This suggests that cluster 0 is mainly younger males (low age, sex < 0) who have below-average clinical measures. Cluster 1 is average-age males who have typical clinical measures apart from decreased s3 and elevated s4. Clusters 2 and 3 pick out subgroups of females.

Not reporting averaged statistics for the 'outlier' points (cluster label = -1) since they don't belong to any particular cluster.

Reproducible example

Imports and load a standardised dataset:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from umap import UMAP
from sklearn.datasets import load_diabetes
from sklearn.cluster import HDBSCAN
from matplotlib.colors import ListedColormap
import glasbey
# ---- Dataset for testing ----
features_df, _ = load_diabetes(return_X_y=True, as_frame=True)
#features_df is already scaled
X = features_df.to_numpy()

Fit a clusterer (HDBSCAN) and prepare for visualisation:

# ---- Down-project and fit clusterer ----
n_neighbors = 3 #Take a very local view. NB somewhat contrived for this toy example.
proj_5d = UMAP(n_components=5, n_neighbors=n_neighbors, min_dist=0, random_state=0).fit_transform(X)
labels = HDBSCAN(min_cluster_size=15).fit_predict(proj_5d)
unique_labels = np.unique(labels)
# ---- Down-project for visualisation ----
proj_2d = UMAP(n_neighbors=n_neighbors, random_state=15).fit_transform(X)

Plot samples in 2D, coloured by predicted cluster:

# --- View projection and clusterings ----
#Create custom-length discrete colormap having visually distinctive colors
cmap = ListedColormap(
 glasbey.create_palette(palette_size=len(unique_labels))
)
f, ax = plt.subplots(figsize=(7, 5))
#Scatter data using 2D UMAP, coloured by cluster label
scatt = ax.scatter(*proj_2d.T, marker='.', c=labels, alpha=0.9, cmap=cmap)
#Colorbar
cbar = f.colorbar(mappable=scatt, label='HDBSCAN cluster label')
cbar.ax.set_yticks(unique_labels)
#Formatting
ax.spines[:].set_visible(False)
ax.grid(linestyle=':', color='lightgray')
ax.tick_params(axis='both', left=False, bottom=False)
ax.set(xlabel='$UMAP^{2D}$ 1', ylabel='$UMAP^{2D}$ 2')

Aggregate samples based on cluster, and visualise averaged feature characteristics:

# ---- Compare the the averaged characteristics of clusters ---
(
 features_df
 .assign(cluster_label=labels)
 .query('cluster_label != -1')
 .groupby('cluster_label').median()
 .T
 .style
 .format(precision=3)
 .bar(color='olive', axis=1, align='zero')
)

MuhammedYunus 5,2922 gold badges4 silver badges18 bronze badges · Accepted Answer · 2025-07-08 14:16:20Z

You could aggregate the samples by cluster label, and use that aggregation to report the mean of each feature per cluster. Also consider things like how variable each feature is per cluster (i.e. report IQR or CoV as well as the mean/median etc).

For a toy dataset, I produce this visualisation coloured by cluster label:

enter image description here

I add the cluster labels to the original dataframe, and then visualise the median feature value per cluster. I am using the same axis scaling across clusters in order to easily compare between them.

enter image description here

This suggests that cluster 0 is mainly younger males (low age, sex < 0) who have below-average clinical measures. Cluster 1 is average-age males who have typical clinical measures apart from decreased s3 and elevated s4. Clusters 2 and 3 pick out subgroups of females.

Not reporting averaged statistics for the 'outlier' points (cluster label = -1) since they don't belong to any particular cluster.

Reproducible example

Imports and load a standardised dataset:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from umap import UMAP
from sklearn.datasets import load_diabetes
from sklearn.cluster import HDBSCAN
from matplotlib.colors import ListedColormap
import glasbey
# ---- Dataset for testing ----
features_df, _ = load_diabetes(return_X_y=True, as_frame=True)
#features_df is already scaled
X = features_df.to_numpy()

Fit a clusterer (HDBSCAN) and prepare for visualisation:

# ---- Down-project and fit clusterer ----
n_neighbors = 3 #Take a very local view. NB somewhat contrived for this toy example.
proj_5d = UMAP(n_components=5, n_neighbors=n_neighbors, min_dist=0, random_state=0).fit_transform(X)
labels = HDBSCAN(min_cluster_size=15).fit_predict(proj_5d)
unique_labels = np.unique(labels)
# ---- Down-project for visualisation ----
proj_2d = UMAP(n_neighbors=n_neighbors, random_state=15).fit_transform(X)

Plot samples in 2D, coloured by predicted cluster:

# --- View projection and clusterings ----
#Create custom-length discrete colormap having visually distinctive colors
cmap = ListedColormap(
 glasbey.create_palette(palette_size=len(unique_labels))
)
f, ax = plt.subplots(figsize=(7, 5))
#Scatter data using 2D UMAP, coloured by cluster label
scatt = ax.scatter(*proj_2d.T, marker='.', c=labels, alpha=0.9, cmap=cmap)
#Colorbar
cbar = f.colorbar(mappable=scatt, label='HDBSCAN cluster label')
cbar.ax.set_yticks(unique_labels)
#Formatting
ax.spines[:].set_visible(False)
ax.grid(linestyle=':', color='lightgray')
ax.tick_params(axis='both', left=False, bottom=False)
ax.set(xlabel='$UMAP^{2D}$ 1', ylabel='$UMAP^{2D}$ 2')

Aggregate samples based on cluster, and visualise averaged feature characteristics:

# ---- Compare the the averaged characteristics of clusters ---
(
 features_df
 .assign(cluster_label=labels)
 .query('cluster_label != -1')
 .groupby('cluster_label').median()
 .T
 .style
 .format(precision=3)
 .bar(color='olive', axis=1, align='zero')
)

CollectivesTM on Stack Overflow

HDBSCAN Interpretation and Logic

1 Answer 1

Reproducible example

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Reproducible example

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related