Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Real-time page flip detection for MonReader app using CNN + motion features. Achieves 96% F1 score with 20ms inference for hands-free document scanning for blind users.

Notifications You must be signed in to change notification settings

krishna11-dot/pageflip-cv-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

17 Commits

Repository files navigation

Page Flip Detection System

Intelligent page flip detection for automatic document scanning in MonReader

Python PyTorch License


1️⃣ PROBLEM - Why Did We Build This?

The Business Context

MonReader is a mobile document digitization application designed for:

  • πŸ“š The blind and visually impaired - Hands-free document scanning
  • πŸ”¬ Researchers - Bulk document scanning in seconds
  • πŸ“– Everyone - Fully automatic, high-speed, high-quality scanning

The Core Challenge: MonReader must automatically detect when a user flips a page to trigger high-resolution capture, corner detection, dewarping, and OCR - all without requiring the user to tap a button.

Why This Is Hard

Traditional button-based scanning is:

  • ❌ Slow: Requires manual interaction per page
  • ❌ Error-prone: Users must frame shots perfectly
  • ❌ Not accessible: Blind users cannot aim cameras precisely

What we need: A system that watches low-resolution camera preview and automatically detects the exact moment of page flip to capture a perfect shot.

Technical Challenge

Page flip detection requires understanding both:

  1. What's in the frame: Hand, page, book position

    • Like looking at a photo - what do you see?
  2. How things are changing: Motion patterns during flip

    • Like watching a video - what's moving and how?

A simple motion detector would trigger on any movement (hand adjusting, turning the book, camera shake). We need to specifically recognize the unique movement pattern of a page flip.


2️⃣ SOLUTION - What Does It Do?

The System

A deep learning-based page flip detector that:

  • βœ… Processes single frames from low-resolution camera preview
  • βœ… Detects page flips in 20-50ms per frame (real-time capable)
  • βœ… Combines image features (CNN) with motion features (frame differencing)
  • βœ… Achieves 96% F1 score - reliable enough for production use

How Users Experience It

User Action: MonReader Response:
1. Point camera at book β†’ Live preview (low-res)
2. Flip page β†’ Flip detected! (our model)
3. Continue flipping β†’ High-res capture triggered
 β†’ Auto crop, dewarp, OCR
 β†’ Next page ready

Key Innovation: Dual-Input Architecture

×ば぀96 RGB) β”‚ β”‚ + β”‚ β”‚ Motion Features (3 values: mean, std, max) β”‚ β”‚ ↓ β”‚ β”‚ CNN extracts spatial features (what's in frame) β”‚ β”‚ Motion features provide temporal context (how changing)β”‚ β”‚ ↓ β”‚ β”‚ Feature Fusion combines both information streams β”‚ β”‚ ↓ β”‚ β”‚ Binary Classification: Flip (1) or Not-Flip (0) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜">
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SINGLE FRAME CLASSIFICATION β”‚
β”‚ (No sequence modeling - simpler, faster) β”‚
β”‚ β”‚
β”‚ Input: Current Frame (×ば぀96 RGB) β”‚
β”‚ + β”‚
β”‚ Motion Features (3 values: mean, std, max) β”‚
β”‚ ↓ β”‚
β”‚ CNN extracts spatial features (what's in frame) β”‚
β”‚ Motion features provide temporal context (how changing)β”‚
β”‚ ↓ β”‚
β”‚ Feature Fusion combines both information streams β”‚
β”‚ ↓ β”‚
β”‚ Binary Classification: Flip (1) or Not-Flip (0) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Design?

❌ Rejected: LSTM/RNN for sequence modeling (complex, slow, unnecessary) βœ… Chosen: Single-frame CNN + motion features (simple, fast, sufficient)

Key Insight from Mentor: Each frame contains all information needed to detect a flip.

Think of it like this:

  • ❌ Don't need: A video of 10 frames to see the pattern
  • βœ… Do need: Just 1 snapshot + how much things moved

The motion pattern, hand position, and page curvature in a single moment are enough - no need to analyze sequences of frames.


3️⃣ RESULT - Did It Work?

Business Metrics (What Matters)

Metric Target Achieved Business Impact
False Positive Rate <5% 3.2% Users won't get frustrated by accidental triggers
Recall (Catch Rate) >90% 95.5% Catches nearly all flips - complete scanning
Inference Speed <100ms 20-50ms Real-time response in mobile app
Model Size <10MB 4.86MB Fits on mobile devices

Technical Metrics

Performance on Test Set:
β”œβ”€ F1 Score: 0.96 βœ“ (Excellent balance)
β”œβ”€ Accuracy: 0.96 βœ“ (High correctness)
β”œβ”€ Precision: 0.96 βœ“ (96% of "flip" predictions correct)
β”œβ”€ Recall: 0.96 βœ“ (Catches 96% of actual flips)
└─ Specificity: 0.97 βœ“ (97% of non-flips correctly ignored)

What This Means:

  • Out of 100 flips, we catch 96 and miss 4
  • Out of 100 "flip" alerts, 96 are real and 4 are false alarms
  • Production-ready performance

Why It Works - Key Insights

1. Motion Features Distinguish Flip from Other Motion

Motion Statistics During Different Actions:
Action β”‚ Mean Motion β”‚ Std Motion β”‚ Max Motion β”‚ Our Prediction
──────────────────────┼─────────────┼────────────┼────────────┼────────────────
Page Flip β”‚ HIGH β”‚ HIGH β”‚ HIGH β”‚ β†’ FLIP βœ“
Hand Adjusting β”‚ LOW β”‚ LOW β”‚ MEDIUM β”‚ β†’ NOT FLIP βœ“
Camera Shake β”‚ MEDIUM β”‚ LOW β”‚ MEDIUM β”‚ β†’ NOT FLIP βœ“
Turning Book β”‚ MEDIUM β”‚ MEDIUM β”‚ MEDIUM β”‚ β†’ NOT FLIP βœ“
Static Reading β”‚ VERY LOW β”‚ VERY LOW β”‚ VERY LOW β”‚ β†’ NOT FLIP βœ“

Key Insight: Page flips have a unique motion signature - high overall motion (mean), non-uniform motion (high std), and sharp edge movements (high max).

2. Single-Frame Classification Is Sufficient

Initially considered: LSTM to model sequences of frames nsight: "Each frame contains all information needed" Result: Simpler CNN approach works just as well, ×ば぀ faster

Why single-frame works:

  • Page curvature visible in one frame
  • Hand position indicates flip action
  • Motion features provide temporal context
  • Action is instantaneous enough

3. Honest Learning Journey - Not All Analysis Is Useful

Mistake Made: Created frame distribution histogram (flip vs not-flip counts) Mentor Question: "What do you get from this chart?" Honest Answer: "Nothing much, just frame distribution" Lesson Learned: Always ask:

  • What question does this visualization answer?
  • What decision does it inform?
  • Does it provide actionable insight?

This taught me to be intentional with analysis rather than creating visualizations for their own sake.

4. Training Is Noisy - Focus on Trends

Real Training Example:

Epoch 3 Anomaly:
 Val F1 dropped from 0.82 β†’ 0.35 β†’ 0.84
 What happened?
 - Model became overly cautious (100% precision, 21% recall)
 Translation: Only said "flip" when 100% sure
 But missed 79% of actual flips!
 - Temporary stuck point during learning
 - Fixed itself in next epoch
 Lesson: Don't panic when one training round looks bad
 Look at the big picture (is it improving overall?)

Simple Explanation: "Training has natural randomness - like flipping a coin, you might get 3 heads in a row even though it should be 50/50. One bad epoch doesn't mean failure. What matters is: Are things getting better when you look at 3-5 training rounds together?"

Interview Version: "This taught me that training has inherent randomness. Individual epochs can fluctuate, but what matters is the overall pattern across 3-5 epochs. In Epoch 3, my model temporarily became too cautious and performance dipped, but by Epoch 4 it recovered and continued improving. This is normal in deep learning."

5. Validation Can Be Higher Than Training (And That's OK!)

Final Results:
 Training: 89% accuracy, F1=0.86
 Validation: 94% accuracy, F1=0.90
 Gap: 5% (HEALTHY)

Why This Happens (Simple Explanation):

Think of it like taking a test:

  1. During training: Some brain cells randomly turned off (dropout), questions made harder (augmentation)

    • Like studying with distractions and harder practice problems
  2. During validation: Full brain power, normal difficulty questions

    • Like taking the actual test in quiet room with standard questions

So validation being slightly better is normal!

When it's a problem:

  • Gap >10% (validation WAY better) β†’ Something's wrong
  • Training accuracy too low β†’ Model not learning properly

When it's healthy:

  • Small gap (<5%) β†’ This is normal! βœ“
  • Both metrics high β†’ Model learned well βœ“

4️⃣ HOW IT WORKS - System Architecture

Complete Pipeline Flow

×ば぀1.2) β”‚ β”‚ Why? Sharpen page edges and hand boundaries β”‚ β”‚ ↓ β”‚ β”‚ 3. Sharpness Enhancement (×ば぀1.1) β”‚ β”‚ Why? Emphasize motion blur patterns β”‚ β”‚ ↓ β”‚ β”‚ 4. Resize to ×ば぀96 pixels β”‚ β”‚ Why? Balance: ×ば぀56 too grainy, ×ば぀224 too slow β”‚ β”‚ ↓ β”‚ β”‚ Normalized ×ば぀96 RGB Image: [0, 1] range β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Stage 4: FEATURE EXTRACTION (CNN) β”‚ β”‚ β”‚ β”‚ Input: ×ば぀3 Image β”‚ β”‚ ↓ β”‚ β”‚ Conv Block 1: ×ば぀3 kernels] β†’ 32 features β”‚ β”‚ β€’ BatchNorm β†’ ReLU β†’ MaxPool β†’ Dropout(0.1) β”‚ β”‚ β€’ Learns: Basic edges, textures β”‚ β”‚ ↓ β”‚ β”‚ Conv Block 2: ×ば぀5 kernels] β†’ 64 features ← LARGER! β”‚ β”‚ β€’ BatchNorm β†’ ReLU β†’ MaxPool β†’ Dropout(0.15) β”‚ β”‚ β€’ Learns: Motion blur, page curvature β”‚ β”‚ β€’ Why ×ば぀5? Captures broader patterns β”‚ β”‚ ↓ β”‚ β”‚ Conv Block 3: ×ば぀3 kernels] β†’ 128 features β”‚ β”‚ β€’ BatchNorm β†’ ReLU β†’ MaxPool β†’ Dropout(0.2) β”‚ β”‚ β€’ Learns: Hand shapes, page positions β”‚ β”‚ ↓ β”‚ β”‚ Conv Block 4: ×ば぀3 kernels] β†’ 192 features β”‚ β”‚ β€’ BatchNorm β†’ ReLU β†’ Global Avg Pool β”‚ β”‚ β€’ Learns: High-level flip patterns β”‚ β”‚ ↓ β”‚ β”‚ Image Features: 192-dimensional vector β”‚ β”‚ β”‚ β”‚ Key Design: Multi-scale kernels [3,5,3,3] β”‚ β”‚ β€’ ×ば぀3: Fine details (edges, textures) β”‚ β”‚ β€’ ×ば぀5: Broader patterns (motion, curvature) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Stage 5: FEATURE FUSION β”‚ β”‚ β”‚ β”‚ Image Features (192) + Motion Features (3) β†’ 195 dimensions β”‚ β”‚ ↓ β”‚ β”‚ Dense Layer: 195 β†’ 96 neurons β”‚ β”‚ β€’ BatchNorm β†’ ReLU β†’ Dropout(0.3) β”‚ β”‚ β€’ Combines: "What I see" + "How it's changing" β”‚ β”‚ ↓ β”‚ β”‚ Fused Features: 96-dimensional vector β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Stage 6: CLASSIFICATION β”‚ β”‚ β”‚ β”‚ Fused Features (96) β”‚ β”‚ ↓ β”‚ β”‚ Classification Layer: 96 β†’ 1 neuron β”‚ β”‚ ↓ β”‚ β”‚ Sigmoid Activation β†’ Probability [0, 1] β”‚ β”‚ ↓ β”‚ β”‚ Threshold: 0.15 (optimized, NOT default 0.5) β”‚ β”‚ ↓ β”‚ β”‚ Final Prediction: β”‚ β”‚ β€’ Probability > 0.15 β†’ "FLIP" (1) β”‚ β”‚ β€’ Probability ≀ 0.15 β†’ "NOT FLIP" (0) β”‚ β”‚ β”‚ β”‚ Why 0.15? Maximizes F1 score on validation set β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜">
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 1: VIDEO INPUT β”‚
β”‚ Camera Preview β†’ Extract Frames β†’ Store in memory β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 2: MOTION FEATURE EXTRACTION β”‚
β”‚ β”‚
β”‚ Frame[i] - Frame[i-1] = Difference Image β”‚
β”‚ ↓ β”‚
β”‚ Convert to grayscale β”‚
β”‚ ↓ β”‚
β”‚ Calculate: β”‚
β”‚ β€’ mean_motion: Average pixel change (overall activity) β”‚
β”‚ β€’ std_motion: Motion uniformity (edge emphasis) β”‚
β”‚ β€’ max_motion: Peak intensity (sharp movements) β”‚
β”‚ ↓ β”‚
β”‚ 3-dimensional motion vector: [mean, std, max] β”‚
β”‚ β”‚
β”‚ CACHED: Saved to disk (30 min β†’ 2 sec on reruns) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 3: IMAGE PREPROCESSING β”‚
β”‚ β”‚
β”‚ Original Frame (varying sizes) β”‚
β”‚ ↓ β”‚
β”‚ 1. Crop to Center (focus on action area) β”‚
β”‚ ↓ β”‚
β”‚ 2. Contrast Enhancement (×ば぀1.2) β”‚
β”‚ Why? Sharpen page edges and hand boundaries β”‚
β”‚ ↓ β”‚
β”‚ 3. Sharpness Enhancement (×ば぀1.1) β”‚
β”‚ Why? Emphasize motion blur patterns β”‚
β”‚ ↓ β”‚
β”‚ 4. Resize to ×ば぀96 pixels β”‚
β”‚ Why? Balance: ×ば぀56 too grainy, ×ば぀224 too slow β”‚
β”‚ ↓ β”‚
β”‚ Normalized ×ば぀96 RGB Image: [0, 1] range β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 4: FEATURE EXTRACTION (CNN) β”‚
β”‚ β”‚
β”‚ Input: ×ば぀3 Image β”‚
β”‚ ↓ β”‚
β”‚ Conv Block 1: ×ば぀3 kernels] β†’ 32 features β”‚
β”‚ β€’ BatchNorm β†’ ReLU β†’ MaxPool β†’ Dropout(0.1) β”‚
β”‚ β€’ Learns: Basic edges, textures β”‚
β”‚ ↓ β”‚
β”‚ Conv Block 2: ×ば぀5 kernels] β†’ 64 features ← LARGER! β”‚
β”‚ β€’ BatchNorm β†’ ReLU β†’ MaxPool β†’ Dropout(0.15) β”‚
β”‚ β€’ Learns: Motion blur, page curvature β”‚
β”‚ β€’ Why ×ば぀5? Captures broader patterns β”‚
β”‚ ↓ β”‚
β”‚ Conv Block 3: ×ば぀3 kernels] β†’ 128 features β”‚
β”‚ β€’ BatchNorm β†’ ReLU β†’ MaxPool β†’ Dropout(0.2) β”‚
β”‚ β€’ Learns: Hand shapes, page positions β”‚
β”‚ ↓ β”‚
β”‚ Conv Block 4: ×ば぀3 kernels] β†’ 192 features β”‚
β”‚ β€’ BatchNorm β†’ ReLU β†’ Global Avg Pool β”‚
β”‚ β€’ Learns: High-level flip patterns β”‚
β”‚ ↓ β”‚
β”‚ Image Features: 192-dimensional vector β”‚
β”‚ β”‚
β”‚ Key Design: Multi-scale kernels [3,5,3,3] β”‚
β”‚ β€’ ×ば぀3: Fine details (edges, textures) β”‚
β”‚ β€’ ×ば぀5: Broader patterns (motion, curvature) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 5: FEATURE FUSION β”‚
β”‚ β”‚
β”‚ Image Features (192) + Motion Features (3) β†’ 195 dimensions β”‚
β”‚ ↓ β”‚
β”‚ Dense Layer: 195 β†’ 96 neurons β”‚
β”‚ β€’ BatchNorm β†’ ReLU β†’ Dropout(0.3) β”‚
β”‚ β€’ Combines: "What I see" + "How it's changing" β”‚
β”‚ ↓ β”‚
β”‚ Fused Features: 96-dimensional vector β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 6: CLASSIFICATION β”‚
β”‚ β”‚
β”‚ Fused Features (96) β”‚
β”‚ ↓ β”‚
β”‚ Classification Layer: 96 β†’ 1 neuron β”‚
β”‚ ↓ β”‚
β”‚ Sigmoid Activation β†’ Probability [0, 1] β”‚
β”‚ ↓ β”‚
β”‚ Threshold: 0.15 (optimized, NOT default 0.5) β”‚
β”‚ ↓ β”‚
β”‚ Final Prediction: β”‚
β”‚ β€’ Probability > 0.15 β†’ "FLIP" (1) β”‚
β”‚ β€’ Probability ≀ 0.15 β†’ "NOT FLIP" (0) β”‚
β”‚ β”‚
β”‚ Why 0.15? Maximizes F1 score on validation set β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Architecture (Technical View)

×ば぀3) β”‚ β”‚ + BatchNorm + ReLU β”‚ β”‚ + MaxPool(×ば぀2) β”‚ β”‚ + Dropout2D(0.1) β”‚ β”‚ Output: (batch, 32, 48, 48)β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Block 2: Conv2D(32β†’64, ×ば぀5) ← BIG! β”‚ β”‚ + BatchNorm + ReLU β”‚ β”‚ + MaxPool(×ば぀2) β”‚ β”‚ + Dropout2D(0.15) β”‚ β”‚ Output: (batch, 64, 24, 24)β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Block 3: Conv2D(64β†’128, ×ば぀3) β”‚ β”‚ + BatchNorm + ReLU β”‚ β”‚ + MaxPool(×ば぀2) β”‚ β”‚ + Dropout2D(0.2) β”‚ β”‚ Output: (batch, 128, 12, 12)β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Block 4: Conv2D(128β†’192, ×ば぀3) β”‚ β”‚ + BatchNorm + ReLU β”‚ β”‚ + GlobalAvgPool β”‚ β”‚ Output: (batch, 192) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ FEATURE FUSION LAYER β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Concatenate: β”‚ β”‚ Image Features (192) β”‚ β”‚ + Motion Features (3) β”‚ β”‚ = Combined (195) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Dense(195β†’96) β”‚ β”‚ + BatchNorm + ReLU + Dropout(0.3) β”‚ β”‚ Output: (batch, 96) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ CLASSIFICATION HEAD β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Dense(96β†’1) + Sigmoid β”‚ β”‚ Output: (batch, 1) [0, 1] β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Total Parameters: 1,274,753 (~1.27M) Model Size: 4.86 MB">
Input Layer:
 β€’ Image: (batch, 3, 96, 96)
 β€’ Motion: (batch, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CONVOLUTIONAL FEATURE EXTRACTOR β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Block 1: Conv2D(3β†’32, ×ば぀3) β”‚
β”‚ + BatchNorm + ReLU β”‚
β”‚ + MaxPool(×ば぀2) β”‚
β”‚ + Dropout2D(0.1) β”‚
β”‚ Output: (batch, 32, 48, 48)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Block 2: Conv2D(32β†’64, ×ば぀5) ← BIG! β”‚
β”‚ + BatchNorm + ReLU β”‚
β”‚ + MaxPool(×ば぀2) β”‚
β”‚ + Dropout2D(0.15) β”‚
β”‚ Output: (batch, 64, 24, 24)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Block 3: Conv2D(64β†’128, ×ば぀3) β”‚
β”‚ + BatchNorm + ReLU β”‚
β”‚ + MaxPool(×ば぀2) β”‚
β”‚ + Dropout2D(0.2) β”‚
β”‚ Output: (batch, 128, 12, 12)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Block 4: Conv2D(128β†’192, ×ば぀3) β”‚
β”‚ + BatchNorm + ReLU β”‚
β”‚ + GlobalAvgPool β”‚
β”‚ Output: (batch, 192) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FEATURE FUSION LAYER β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Concatenate: β”‚
β”‚ Image Features (192) β”‚
β”‚ + Motion Features (3) β”‚
β”‚ = Combined (195) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Dense(195β†’96) β”‚
β”‚ + BatchNorm + ReLU + Dropout(0.3) β”‚
β”‚ Output: (batch, 96) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CLASSIFICATION HEAD β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Dense(96β†’1) + Sigmoid β”‚
β”‚ Output: (batch, 1) [0, 1] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Total Parameters: 1,274,753 (~1.27M)
Model Size: 4.86 MB

πŸ§ͺ Key Concepts & Nuances (Crystal Clear)

Concept 1: Why Both Image AND Motion Features?

Problem: Images alone don't capture dynamics.

Scenario: Hand positioned over page
Static Image Says: Motion Features Add:
"Hand near page" "Hand moving fast" β†’ Likely flip
 OR "Hand still" β†’ Just hovering

The Magic: Combining both gives complete picture:

  • Image CNN: Recognizes WHAT is in frame (hand, page, book)

    • Like looking at a photo
  • Motion Features: Recognizes HOW things are changing (speed, uniformity, peaks)

    • Like comparing two photos side-by-side

Result: Model understands the action happening, not just the scene frozen in time.

Analogy:

  • Image alone = Seeing someone with raised hand β†’ Are they waving? Reaching? Stretching?
  • Image + Motion = Seeing raised hand + detecting fast sideways movement β†’ They're waving!

Concept 2: Why These Specific Motion Features?

motion_features = [mean_motion, std_motion, max_motion]

Intuition:

  1. mean_motion (Average):

    High mean β†’ Lots of pixels changing β†’ Something moving
    Low mean β†’ Few pixels changing β†’ Mostly static
    Page flip: HIGH (whole page moving)
    Hand adjust: LOW (only small region)
    
  2. std_motion (Standard Deviation):

    High std β†’ Motion not uniform β†’ Some areas move more
    Low std β†’ Motion uniform β†’ Everything moves similarly
    Page flip: HIGH (edges move fast, center slower)
    Camera shake: LOW (everything moves uniformly)
    
  3. max_motion (Maximum):

    High max β†’ Sharp edge movements detected
    Low max β†’ Smooth gradual changes
    Page flip: HIGH (page edge creates sharp motion)
    Slow adjustment: LOW (gentle movement)
    

Why Not Optical Flow? (Optical Flow = Fancy motion tracking method)

Comparison:

  • Optical Flow: Like tracking every single object's movement with GPS

    • Very accurate but SLOW (100ms+)
    • Overkill for our problem
  • Our Method: Like checking "did things move a lot, unevenly, and sharply?"

    • Simple math but works great (5ms)
    • Fast enough for real-time

Key Lesson: Don't use a sledgehammer to crack a nut.

  • Don't add complexity because you CAN
  • Add complexity because you MUST

Our simple method works just as well at ×ば぀ the speed!

Concept 3: Multi-Scale Feature Extraction

×ば぀3, ×ば぀5, ×ば぀3, ×ば぀3]? β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ×ば぀3 Kernels (Blocks 1, 3, 4): β”‚ β”‚ β”‚ β”‚ β–ˆβ–ˆ ← Sees ×ば぀3 region β”‚ β”‚ Small receptive field β”‚ β”‚ Captures: Fine details β”‚ β”‚ β€’ Page edges β”‚ β”‚ β€’ Finger textures β”‚ β”‚ β€’ Text patterns β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ×ば぀5 Kernel (Block 2): β”‚ β”‚ β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆ ← Sees ×ば぀5 region β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚ Larger receptive field β”‚ β”‚ Captures: Broader patterns β”‚ β”‚ β€’ Motion blur extent β”‚ β”‚ β€’ Page curvature β”‚ β”‚ β€’ Hand-page relationship β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜">
Why Varied Kernel Sizes ×ば぀3, ×ば぀5, ×ば぀3, ×ば぀3]?
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ×ば぀3 Kernels (Blocks 1, 3, 4): β”‚
β”‚ β”‚
β”‚ β–ˆβ–ˆ ← Sees ×ば぀3 region β”‚
β”‚ Small receptive field β”‚
β”‚ Captures: Fine details β”‚
β”‚ β€’ Page edges β”‚
β”‚ β€’ Finger textures β”‚
β”‚ β€’ Text patterns β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ×ば぀5 Kernel (Block 2): β”‚
β”‚ β”‚
β”‚ β–ˆβ–ˆβ–ˆβ–ˆ ← Sees ×ば぀5 region β”‚
β”‚ β–ˆβ–ˆβ–ˆβ–ˆ β”‚
β”‚ Larger receptive field β”‚
β”‚ Captures: Broader patterns β”‚
β”‚ β€’ Motion blur extent β”‚
β”‚ β€’ Page curvature β”‚
β”‚ β€’ Hand-page relationship β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Matters:

  • Fast flips: Create broad motion blur β†’ ×ば぀5 catches it
  • Slow flips: Sharp page edges β†’ ×ば぀3 catches it
  • Result: Robust to varying flip speeds

Concept 4: Threshold Optimization (Why Not 0.5?)

Default Thinking:

prediction = 1 if probability > 0.5 else 0

Problem: 0.5 is arbitrary!

Our Approach: Test many thresholds, pick best F1 score

Threshold | Precision | Recall | F1 | What This Means
──────────┼───────────┼────────┼───────┼─────────────────────────
0.10 | 0.93 | 0.97 | 0.95 | Catch more, some false alarms
0.15 | 0.96 | 0.96 | 0.96 | ← OPTIMAL (balanced)
0.20 | 0.97 | 0.95 | 0.96 | Slightly more conservative
0.50 | 0.99 | 0.88 | 0.93 | Too conservative (misses flips)
0.90 | 1.00 | 0.67 | 0.80 | Way too conservative

Simple Explanation: Our model learned conservatively - it gives lower probability scores even when correct.

Why?: The training data had more "not-flip" examples than "flip" examples (data imbalance), so the model learned to be cautious.

The Fix:

  • Using 0.5 threshold β†’ Misses 12% of flips (too strict)
  • Using 0.15 threshold β†’ Catches 96% of flips (just right!)

Analogy: If your spam filter requires 90% certainty to mark spam, it might miss obvious spam emails. Lower the threshold to 50% certainty, and you catch more spam without many false alarms.

Interview Answer: "I optimized the threshold by testing values from 0.1 to 0.9 on the validation set and selecting the one that maximizes F1 score. The optimal threshold of 0.15 (not 0.5) accounts for class distribution and achieves the best balance between catching flips (recall) and avoiding false alarms (precision)."

Concept 5: Regularization - Why So Much?

The Risk: Small dataset + deep network = overfitting

What is Overfitting? (Simple explanation)

  • Model memorizes training data like cramming exam answers
  • Gets 100% on practice test but fails real exam
  • Learns specific examples, not general patterns

What is Underfitting? (Simple explanation)

  • Model too simple to learn even basic patterns
  • Like using a ruler to draw curves
  • Bad on both training and testing

The Sweet Spot: Model that learns patterns (not memorizes examples) and works on new data

πŸ‘‰ For full explanation with analogies, see Training Strategy - Overfitting & Underfitting

Our Defense (5 techniques):

  1. Dropout (Progressive: 0.1 β†’ 0.15 β†’ 0.2 β†’ 0.3):

    Why increasing?
    Early layers: Learn basic features (edges) β†’ Need less regularization
    Late layers: Learn complex patterns β†’ More prone to overfitting
    
  2. L2 Regularization (Weight Decay = 0.0001):

    Penalizes large weights
    Encourages simpler model
    
  3. Batch Normalization (Every layer):

    Stabilizes training
    Acts as regularization (adds noise)
    
  4. Early Stopping (Patience = 3):

    Stops when validation stops improving
    Prevents overfitting to training set
    
  5. Data Augmentation (Rotation Β±5Β°, Brightness ×ば぀):

    Creates variations of training data
    Model sees more diverse examples
    

Why All Five?: Each addresses overfitting from different angle. Combined effect is very robust.


πŸ“– Documentation Structure

For Interview Preparation (Read in Order):

  1. Quick Reference (5 min)

    • 30-second elevator pitch
    • Key metrics and decisions table
    • Last-minute interview soundbites
  2. Project Overview (15 min)

    • Business context and motivation
    • Why this approach?
    • Success criteria
  3. Architecture (30 min)

    • Complete system design
    • Layer-by-layer breakdown
    • Design decision rationale
  4. Data Pipeline (20 min)

    • Motion feature extraction
    • Image preprocessing steps
    • Caching and optimization
  5. Training Strategy (30 min)

    • Loss function (BCE) explained
    • Regularization techniques
    • Training noise and validation patterns
    • Learning rate observations
  6. Evaluation & Results (25 min)

    • Metrics deep dive (Precision, Recall, F1)
    • Threshold optimization process
    • Real training curves with anomalies
  7. Mentor Feedback #1 (15 min)

    • First mentor discussion insights
    • Validation > training explanation
  8. Mentor Insights #2 (20 min)

    • CRITICAL: Single-frame vs sequence decision
    • Why text/content is irrelevant
    • Simplicity vs complexity philosophy
  9. Complete Pipeline (30 min)

    • 6-stage pipeline flow
    • Every preprocessing step explained
    • Jargon glossary (all terms defined)
  10. Visualization Analysis (25 min)

    • Frame distribution chart lessons (honest mistake)
    • Preprocessing image analysis
    • Training metrics deep dive
    • Epoch 3 anomaly explained
    • Interview Q&A for every visualization
  11. Study Guide (20 min)

    • How to study this project
    • 12 essential interview questions
    • Pre-interview checklist
  12. Complete Interview Questions (90 min) ⭐

    • 30+ interview questions with crystal-clear answers
    • Simple explanations + technical versions + analogies
    • Organized by category (Overview, Architecture, Training, etc.)
    • Quick reference section for last-minute prep
    • Real answers from actual training experience

Total Study Time: ~5-6 hours for complete mastery


πŸš€ Quick Start

Prerequisites

# Python 3.8+
python --version
# Required libraries
pip install torch torchvision
pip install numpy pandas matplotlib seaborn
pip install scikit-learn opencv-python pillow tqdm

Data Setup

images/
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ flip/ # Page flip frames
β”‚ └── notflip/ # Normal frames
└── testing/
 β”œβ”€β”€ flip/
 └── notflip/

Run Training

  1. Open page_flip_detection_Sys.ipynb
  2. Update data path: base_path = "/path/to/your/images"
  3. Run all cells

Expected outputs:

  • Training history plots
  • Confusion matrix
  • Test metrics
  • Saved model: best_model_optimized.pth

Key Configuration

×ば぀96 sweet spot) BATCH_SIZE = 128 # Batch size for training NUM_EPOCHS = 10 # Maximum epochs (early stopping triggers earlier) LEARNING_RATE = 0.001 # Initial learning rate (never reduced in our training!) EARLY_STOP_PATIENCE = 3 # Epochs without improvement before stopping USE_MOTION_FEATURES = True # Enable motion features (critical!)">
IMAGE_SIZE = 96 # Input image size (×ば぀96 sweet spot)
BATCH_SIZE = 128 # Batch size for training
NUM_EPOCHS = 10 # Maximum epochs (early stopping triggers earlier)
LEARNING_RATE = 0.001 # Initial learning rate (never reduced in our training!)
EARLY_STOP_PATIENCE = 3 # Epochs without improvement before stopping
USE_MOTION_FEATURES = True # Enable motion features (critical!)

🎯 Interview Preparation

πŸ“š Complete Interview Guide

πŸ‘‰ See Complete Interview Questions (30+ Q&A)

All questions organized by category with:

  • βœ… Simple explanations with analogies
  • βœ… Technical detailed answers
  • βœ… Real examples from training
  • βœ… Quick reference for last-minute prep

Top 5 Most Common Questions (Quick Reference)

Q1: "Walk me through your project in 2 minutes"

30-Second Version: "Page flip detector for blind users. CNN + motion features. 96% F1, 20ms inference. Key insight: single-frame sufficient, no LSTM needed."

Full Answer: "I built a page flip detector for MonReader, a mobile document scanning app for blind users who need hands-free scanning.

Problem: Traditional scanning requires button taps per page - impossible for blind users.

Solution: Real-time CNN combining image features (what's in frame) with motion features (how things change) to detect page flips automatically.

Results: 96% F1 score, 20-50ms inference - production-ready.

Key Insight: Mentor showed each frame contains all info needed - no LSTM required. Simplicity wins: ×ば぀ faster with same accuracy."


Q2: "What is overfitting and how did you prevent it?"

Simple: Model memorizes training data like cramming exam answers. Gets 100% on practice test, fails real exam.

How I prevented it:

  1. Dropout (0.1β†’0.3): Randomly turn off neurons
  2. L2 Regularization: Penalize large weights
  3. Early Stopping: Stop before memorization
  4. Data Augmentation: Harder to memorize variations
  5. Batch Normalization: Add noise

Result: Train 89%, Test 94% (healthy! βœ“)


Q3: "Why did validation outperform training?"

Simple: Like taking a test with full brain power (validation) vs studying with distractions (training).

Technical Reasons:

  1. Dropout OFF during validation β†’ Full capacity
  2. No augmentation during validation β†’ Easier samples
  3. Gap is small (5%) and both high β†’ Healthy!

When it's a problem: Gap >10%, or train suspiciously low


Q4: "Why F1 score instead of accuracy?"

Spam Filter Analogy:

  • Dumb filter: "Everything is NOT SPAM" β†’ 95% accuracy but catches ZERO spam!
  • Smart filter: Uses F1 β†’ Balances catching spam (recall) with accuracy (precision)

Our Case:

  • Accuracy: Can be misled by class imbalance
  • F1: Balances precision (user trust) + recall (completeness)
  • Both at 96% β†’ Production-ready

Q5: "What was your biggest challenge?"

Problem: Distinguishing page flips from hand adjustments, camera shake, book rotation.

Solution: Motion features create unique flip signature:

  • Mean: HIGH (lots of movement)
  • Std: HIGH (uneven - edges move more)
  • Max: HIGH (sharp page edge)

vs other motion patterns (LOW, LOW, MEDIUM)

Result: 96% F1 vs 72% with image only


πŸ”₯ Pro Tips for Interviews

  1. Start simple, go deep: Begin with analogy, then technical details if asked
  2. Use numbers: "96% F1, 20ms inference" is concrete
  3. Show growth: Mention the frame distribution chart mistake
  4. Connect to business: Always link technical choices to user impact
  5. Be honest: "I learned this from my mentor" shows collaboration

πŸ‘‰ See All 30+ Questions & Answers


πŸ”§ Troubleshooting

Out of Memory Error

BATCH_SIZE = 64 # Reduce from 128
IMAGE_SIZE = 64 # Reduce from 96

Slow Training

num_workers = 8 # Increase for faster data loading
persistent_workers = True # Keep workers alive between epochs

Poor Performance

  1. Check data quality (visualize samples)
  2. Verify class balance (should be roughly balanced)
  3. Check training curves (look for overfitting)
  4. Try lower threshold (improve recall)

🌟 Key Takeaways

  1. Multi-modal learning works: Combining image + motion features significantly outperforms either alone

  2. Simplicity wins: Single-frame classification is sufficient, no need for complex sequence models

  3. Threshold matters: Default 0.5 is often suboptimal - optimize based on validation F1

  4. Training is noisy: Focus on trends over 3-5 epochs, not individual epoch drops

  5. Regularization is essential: Multiple techniques prevent overfitting on limited data

  6. Intentional analysis: Not all visualizations are useful - ask what question each one answers

  7. Honest self-assessment: Admitting mistakes (frame distribution chart)


πŸ“š References & Tools

Technologies Used

Key Concepts

  • Convolutional Neural Networks (CNN)
  • Binary Cross-Entropy Loss
  • Batch Normalization & Dropout
  • Multi-modal learning (image + motion)
  • Threshold optimization
  • F1 Score for imbalanced classification

πŸ“ License

MIT License - See LICENSE file for details


πŸ‘€ Contact

Krishna Balachandran Nair


πŸ™ Acknowledgments

  • MonReader Team - Project context and real-world application
  • Mentor Guidance - Critical insights on single-frame sufficiency, simplicity, and meaningful analysis
  • PyTorch Community - Excellent documentation and tutorials

Built for MonReader: Making document scanning fully automatic, fast, and accessible for everyone.

About

Real-time page flip detection for MonReader app using CNN + motion features. Achieves 96% F1 score with 20ms inference for hands-free document scanning for blind users.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /