Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Pedagogy: add 15_optimization/038 — optimization on a histogram surrogate (classification example) #425

Open
Labels
pedagogyNotebook content / teaching changes

Description

Summary

Build 15_optimization/038_optimization_on_histogram_surrogate.ipynb as a companion to the existing 030_Classification_Optimization.ipynb. The new notebook teaches a general numerical-methods pattern: when iterative parameter optimization is expensive over $N$ raw data points, replace the per-point sum in the cost function with a per-cell sum over a multi-dimensional histogram of the feature space, optimize on the surrogate, then verify the final result on the full dataset.

The pattern was used in Ch. 6 of:

Lee, K. (2004). Longitudinal Driver Model and Collision Warning and Avoidance Algorithms Based on Human Driving Databases. Ph.D. dissertation, Mechanical Engineering, University of Michigan.

There, two 5D histograms reduced 3.4 M data points (~400 MB) to ~1.9 MB while approximating the confusion matrix used by fmincon with 1000 random initial conditions. The same pattern transplants cleanly to a 2D classification example for teaching.

Pedagogical arc

  • Motivate the problem: iterative optimization (multi-start, hyperparameter sweeps) makes the per-point cost-function scan the bottleneck
  • Build per-class histograms from a 2D training set (two overlapping Gaussians, ~10ドル^6$ points) — visualize side-by-side heatmaps analogous to the dissertation's Fig. 6.2 "Threatening" / "Safe" panels
  • Surrogate confusion matrix: $\widehat{TP}, \widehat{FP}, \widehat{TN}, \widehat{FN}$ as weighted sums over cell centroids (the algebra of Eqs. 38–39 in Ch. 6, restated for classification)
  • Train a logistic regression model two ways — on raw data vs on the surrogate — and overlay the decision boundaries
  • Bias-vs-speed tradeoff via an ipywidgets slider on grid resolution $n \times n$, with a live Pareto plot
  • Verify-on-truth step: re-evaluate the surrogate-trained model point-by-point on the full dataset; tabulate the confusion matrix both ways
  • Train / evaluation split for overfitting check (mirrors Ch. 6 §6.1's split-half analysis)

Scope decisions

  • Classifier: logistic regression (clean weighted log-loss on the surrogate) — likely. SVM is more visual but heavier.
  • Grid: uniform for the first notebook. Log-spaced (per dimension) and quantile-spaced are mentioned as alternatives but not implemented in 038.
  • Cost function: choose one of (a) the geometric mean of TP and precision, mirroring Ch. 6 Eq. 40, or (b) sklearn-style weighted log-loss. Signpost the other in the closing.

Curricular hooks

  • Bridges 20_probability/ histograms (per-class joint densities) and 15_optimization/ (optimization on a surrogate)
  • Direct follow-on to 030_Classification_Optimization.ipynb
  • Train-on-surrogate, verify-on-truth pattern is a general engineering-numerical-methods lesson reusable elsewhere

Out of scope (parked as future threads)

  • Adaptive $C$-groups (k-means coreset) — extends the rectangular grid to data-adaptive cells, the move that escapes the curse of dimensionality. Would live as a sibling notebook (~039) once 038 lands.
  • Soft-assignment / Gaussian-mixture surrogate
  • Boundary-aware adaptive refinement (cluster misclassified points after a first pass)
  • Sufficient-statistics-per-cell extension (Bradley, Fayyad, Reina 1998), the high-D successor to histograms

Metadata

Metadata

Assignees

No one assigned

    Labels

    pedagogyNotebook content / teaching changes

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /