Pedagogy: add 15_optimization/038 — optimization on a histogram surrogate (classification example) #425

Open

Labels

pedagogy

@kwlee2025cpp

Description

@kwlee2025cpp

kwlee2025cpp

opened

on May 12, 2026

Summary

Build 15_optimization/038_optimization_on_histogram_surrogate.ipynb as a companion to the existing 030_Classification_Optimization.ipynb. The new notebook teaches a general numerical-methods pattern: when iterative parameter optimization is expensive over $N$ raw data points, replace the per-point sum in the cost function with a per-cell sum over a multi-dimensional histogram of the feature space, optimize on the surrogate, then verify the final result on the full dataset.

The pattern was used in Ch. 6 of:

Lee, K. (2004). Longitudinal Driver Model and Collision Warning and Avoidance Algorithms Based on Human Driving Databases. Ph.D. dissertation, Mechanical Engineering, University of Michigan.

There, two 5D histograms reduced 3.4 M data points (~400 MB) to ~1.9 MB while approximating the confusion matrix used by fmincon with 1000 random initial conditions. The same pattern transplants cleanly to a 2D classification example for teaching.

Pedagogical arc

Motivate the problem: iterative optimization (multi-start, hyperparameter sweeps) makes the per-point cost-function scan the bottleneck
Build per-class histograms from a 2D training set (two overlapping Gaussians, ~10ドル^6$ points) — visualize side-by-side heatmaps analogous to the dissertation's Fig. 6.2 "Threatening" / "Safe" panels
Surrogate confusion matrix: $\widehat{TP}, \widehat{FP}, \widehat{TN}, \widehat{FN}$ as weighted sums over cell centroids (the algebra of Eqs. 38–39 in Ch. 6, restated for classification)
Train a logistic regression model two ways — on raw data vs on the surrogate — and overlay the decision boundaries
Bias-vs-speed tradeoff via an ipywidgets slider on grid resolution $n \times n$, with a live Pareto plot
Verify-on-truth step: re-evaluate the surrogate-trained model point-by-point on the full dataset; tabulate the confusion matrix both ways
Train / evaluation split for overfitting check (mirrors Ch. 6 §6.1's split-half analysis)

Scope decisions

Classifier: logistic regression (clean weighted log-loss on the surrogate) — likely. SVM is more visual but heavier.
Grid: uniform for the first notebook. Log-spaced (per dimension) and quantile-spaced are mentioned as alternatives but not implemented in 038.
Cost function: choose one of (a) the geometric mean of TP and precision, mirroring Ch. 6 Eq. 40, or (b) sklearn-style weighted log-loss. Signpost the other in the closing.

Curricular hooks

Bridges 20_probability/ histograms (per-class joint densities) and 15_optimization/ (optimization on a surrogate)
Direct follow-on to 030_Classification_Optimization.ipynb
Train-on-surrogate, verify-on-truth pattern is a general engineering-numerical-methods lesson reusable elsewhere

Out of scope (parked as future threads)

Adaptive $C$-groups (k-means coreset) — extends the rectangular grid to data-adaptive cells, the move that escapes the curse of dimensionality. Would live as a sibling notebook (~039) once 038 lands.
Soft-assignment / Gaussian-mixture surrogate
Boundary-aware adaptive refinement (cluster misclassified points after a first pass)
Sufficient-statistics-per-cell extension (Bradley, Fayyad, Reina 1998), the high-D successor to histograms

Metadata

Assignees

No one assigned

Labels

pedagogy

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pedagogy: add 15_optimization/038 — optimization on a histogram surrogate (classification example) #425

Description

Summary

Pedagogical arc

Scope decisions

Curricular hooks

Out of scope (parked as future threads)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions