Name	Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs	docs
experimental	experimental
README.md	README.md
page_flip_detection_Sys.ipynb	page_flip_detection_Sys.ipynb

Page Flip Detection System

Intelligent page flip detection for automatic document scanning in MonReader

1️⃣ PROBLEM - Why Did We Build This?

The Business Context

MonReader is a mobile document digitization application designed for:

📚 The blind and visually impaired - Hands-free document scanning
🔬 Researchers - Bulk document scanning in seconds
📖 Everyone - Fully automatic, high-speed, high-quality scanning

The Core Challenge: MonReader must automatically detect when a user flips a page to trigger high-resolution capture, corner detection, dewarping, and OCR - all without requiring the user to tap a button.

Why This Is Hard

Traditional button-based scanning is:

❌ Slow: Requires manual interaction per page
❌ Error-prone: Users must frame shots perfectly
❌ Not accessible: Blind users cannot aim cameras precisely

What we need: A system that watches low-resolution camera preview and automatically detects the exact moment of page flip to capture a perfect shot.

Technical Challenge

Page flip detection requires understanding both:

What's in the frame: Hand, page, book position
- Like looking at a photo - what do you see?
How things are changing: Motion patterns during flip
- Like watching a video - what's moving and how?

A simple motion detector would trigger on any movement (hand adjusting, turning the book, camera shake). We need to specifically recognize the unique movement pattern of a page flip.

2️⃣ SOLUTION - What Does It Do?

The System

A deep learning-based page flip detector that:

✅ Processes single frames from low-resolution camera preview
✅ Detects page flips in 20-50ms per frame (real-time capable)
✅ Combines image features (CNN) with motion features (frame differencing)
✅ Achieves 96% F1 score - reliable enough for production use

How Users Experience It

User Action: MonReader Response:
1. Point camera at book → Live preview (low-res)
2. Flip page → Flip detected! (our model)
3. Continue flipping → High-res capture triggered
 → Auto crop, dewarp, OCR
 → Next page ready

Key Innovation: Dual-Input Architecture

×ばつ96 RGB) │ │ + │ │ Motion Features (3 values: mean, std, max) │ │ ↓ │ │ CNN extracts spatial features (what's in frame) │ │ Motion features provide temporal context (how changing)│ │ ↓ │ │ Feature Fusion combines both information streams │ │ ↓ │ │ Binary Classification: Flip (1) or Not-Flip (0) │ └─────────────────────────────────────────────────────────┘">

┌─────────────────────────────────────────────────────────┐
│ SINGLE FRAME CLASSIFICATION │
│ (No sequence modeling - simpler, faster) │
│ │
│ Input: Current Frame (×ばつ96 RGB) │
│ + │
│ Motion Features (3 values: mean, std, max) │
│ ↓ │
│ CNN extracts spatial features (what's in frame) │
│ Motion features provide temporal context (how changing)│
│ ↓ │
│ Feature Fusion combines both information streams │
│ ↓ │
│ Binary Classification: Flip (1) or Not-Flip (0) │
└─────────────────────────────────────────────────────────┘

Why This Design?

❌ Rejected: LSTM/RNN for sequence modeling (complex, slow, unnecessary) ✅ Chosen: Single-frame CNN + motion features (simple, fast, sufficient)

Key Insight from Mentor: Each frame contains all information needed to detect a flip.

Think of it like this:

❌ Don't need: A video of 10 frames to see the pattern
✅ Do need: Just 1 snapshot + how much things moved

The motion pattern, hand position, and page curvature in a single moment are enough - no need to analyze sequences of frames.

3️⃣ RESULT - Did It Work?

Business Metrics (What Matters)

Metric	Target	Achieved	Business Impact
False Positive Rate	<5%	3.2%	Users won't get frustrated by accidental triggers
Recall (Catch Rate)	>90%	95.5%	Catches nearly all flips - complete scanning
Inference Speed	<100ms	20-50ms	Real-time response in mobile app
Model Size	<10MB	4.86MB	Fits on mobile devices

Technical Metrics

Performance on Test Set:
├─ F1 Score: 0.96 ✓ (Excellent balance)
├─ Accuracy: 0.96 ✓ (High correctness)
├─ Precision: 0.96 ✓ (96% of "flip" predictions correct)
├─ Recall: 0.96 ✓ (Catches 96% of actual flips)
└─ Specificity: 0.97 ✓ (97% of non-flips correctly ignored)

What This Means:

Out of 100 flips, we catch 96 and miss 4
Out of 100 "flip" alerts, 96 are real and 4 are false alarms
Production-ready performance

Why It Works - Key Insights

1. Motion Features Distinguish Flip from Other Motion

Motion Statistics During Different Actions:
Action │ Mean Motion │ Std Motion │ Max Motion │ Our Prediction
──────────────────────┼─────────────┼────────────┼────────────┼────────────────
Page Flip │ HIGH │ HIGH │ HIGH │ → FLIP ✓
Hand Adjusting │ LOW │ LOW │ MEDIUM │ → NOT FLIP ✓
Camera Shake │ MEDIUM │ LOW │ MEDIUM │ → NOT FLIP ✓
Turning Book │ MEDIUM │ MEDIUM │ MEDIUM │ → NOT FLIP ✓
Static Reading │ VERY LOW │ VERY LOW │ VERY LOW │ → NOT FLIP ✓

Key Insight: Page flips have a unique motion signature - high overall motion (mean), non-uniform motion (high std), and sharp edge movements (high max).

2. Single-Frame Classification Is Sufficient

Initially considered: LSTM to model sequences of frames nsight: "Each frame contains all information needed" Result: Simpler CNN approach works just as well, ×ばつ faster

Why single-frame works:

Page curvature visible in one frame
Hand position indicates flip action
Motion features provide temporal context
Action is instantaneous enough

3. Honest Learning Journey - Not All Analysis Is Useful

Mistake Made: Created frame distribution histogram (flip vs not-flip counts) Mentor Question: "What do you get from this chart?" Honest Answer: "Nothing much, just frame distribution" Lesson Learned: Always ask:

What question does this visualization answer?
What decision does it inform?
Does it provide actionable insight?

This taught me to be intentional with analysis rather than creating visualizations for their own sake.

4. Training Is Noisy - Focus on Trends

Real Training Example:

Epoch 3 Anomaly:
 Val F1 dropped from 0.82 → 0.35 → 0.84
 What happened?
 - Model became overly cautious (100% precision, 21% recall)
 Translation: Only said "flip" when 100% sure
 But missed 79% of actual flips!
 - Temporary stuck point during learning
 - Fixed itself in next epoch
 Lesson: Don't panic when one training round looks bad
 Look at the big picture (is it improving overall?)

Simple Explanation: "Training has natural randomness - like flipping a coin, you might get 3 heads in a row even though it should be 50/50. One bad epoch doesn't mean failure. What matters is: Are things getting better when you look at 3-5 training rounds together?"

Interview Version: "This taught me that training has inherent randomness. Individual epochs can fluctuate, but what matters is the overall pattern across 3-5 epochs. In Epoch 3, my model temporarily became too cautious and performance dipped, but by Epoch 4 it recovered and continued improving. This is normal in deep learning."

5. Validation Can Be Higher Than Training (And That's OK!)

Final Results:
 Training: 89% accuracy, F1=0.86
 Validation: 94% accuracy, F1=0.90
 Gap: 5% (HEALTHY)

Why This Happens (Simple Explanation):

Think of it like taking a test:

During training: Some brain cells randomly turned off (dropout), questions made harder (augmentation)
- Like studying with distractions and harder practice problems
During validation: Full brain power, normal difficulty questions
- Like taking the actual test in quiet room with standard questions

So validation being slightly better is normal!

When it's a problem:

Gap >10% (validation WAY better) → Something's wrong
Training accuracy too low → Model not learning properly

When it's healthy:

Small gap (<5%) → This is normal! ✓
Both metrics high → Model learned well ✓

4️⃣ HOW IT WORKS - System Architecture

Complete Pipeline Flow

×ばつ1.2) │ │ Why? Sharpen page edges and hand boundaries │ │ ↓ │ │ 3. Sharpness Enhancement (×ばつ1.1) │ │ Why? Emphasize motion blur patterns │ │ ↓ │ │ 4. Resize to ×ばつ96 pixels │ │ Why? Balance: ×ばつ56 too grainy, ×ばつ224 too slow │ │ ↓ │ │ Normalized ×ばつ96 RGB Image: [0, 1] range │ └──────────────┬───────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Stage 4: FEATURE EXTRACTION (CNN) │ │ │ │ Input: ×ばつ3 Image │ │ ↓ │ │ Conv Block 1: ×ばつ3 kernels] → 32 features │ │ • BatchNorm → ReLU → MaxPool → Dropout(0.1) │ │ • Learns: Basic edges, textures │ │ ↓ │ │ Conv Block 2: ×ばつ5 kernels] → 64 features ← LARGER! │ │ • BatchNorm → ReLU → MaxPool → Dropout(0.15) │ │ • Learns: Motion blur, page curvature │ │ • Why ×ばつ5? Captures broader patterns │ │ ↓ │ │ Conv Block 3: ×ばつ3 kernels] → 128 features │ │ • BatchNorm → ReLU → MaxPool → Dropout(0.2) │ │ • Learns: Hand shapes, page positions │ │ ↓ │ │ Conv Block 4: ×ばつ3 kernels] → 192 features │ │ • BatchNorm → ReLU → Global Avg Pool │ │ • Learns: High-level flip patterns │ │ ↓ │ │ Image Features: 192-dimensional vector │ │ │ │ Key Design: Multi-scale kernels [3,5,3,3] │ │ • ×ばつ3: Fine details (edges, textures) │ │ • ×ばつ5: Broader patterns (motion, curvature) │ └──────────────┬───────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Stage 5: FEATURE FUSION │ │ │ │ Image Features (192) + Motion Features (3) → 195 dimensions │ │ ↓ │ │ Dense Layer: 195 → 96 neurons │ │ • BatchNorm → ReLU → Dropout(0.3) │ │ • Combines: "What I see" + "How it's changing" │ │ ↓ │ │ Fused Features: 96-dimensional vector │ └──────────────┬───────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Stage 6: CLASSIFICATION │ │ │ │ Fused Features (96) │ │ ↓ │ │ Classification Layer: 96 → 1 neuron │ │ ↓ │ │ Sigmoid Activation → Probability [0, 1] │ │ ↓ │ │ Threshold: 0.15 (optimized, NOT default 0.5) │ │ ↓ │ │ Final Prediction: │ │ • Probability > 0.15 → "FLIP" (1) │ │ • Probability ≤ 0.15 → "NOT FLIP" (0) │ │ │ │ Why 0.15? Maximizes F1 score on validation set │ └──────────────────────────────────────────────────────────────────┘">

┌──────────────────────────────────────────────────────────────────┐
│ Stage 1: VIDEO INPUT │
│ Camera Preview → Extract Frames → Store in memory │
└──────────────┬───────────────────────────────────────────────────┘
 │
 ▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 2: MOTION FEATURE EXTRACTION │
│ │
│ Frame[i] - Frame[i-1] = Difference Image │
│ ↓ │
│ Convert to grayscale │
│ ↓ │
│ Calculate: │
│ • mean_motion: Average pixel change (overall activity) │
│ • std_motion: Motion uniformity (edge emphasis) │
│ • max_motion: Peak intensity (sharp movements) │
│ ↓ │
│ 3-dimensional motion vector: [mean, std, max] │
│ │
│ CACHED: Saved to disk (30 min → 2 sec on reruns) │
└──────────────┬───────────────────────────────────────────────────┘
 │
 ▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 3: IMAGE PREPROCESSING │
│ │
│ Original Frame (varying sizes) │
│ ↓ │
│ 1. Crop to Center (focus on action area) │
│ ↓ │
│ 2. Contrast Enhancement (×ばつ1.2) │
│ Why? Sharpen page edges and hand boundaries │
│ ↓ │
│ 3. Sharpness Enhancement (×ばつ1.1) │
│ Why? Emphasize motion blur patterns │
│ ↓ │
│ 4. Resize to ×ばつ96 pixels │
│ Why? Balance: ×ばつ56 too grainy, ×ばつ224 too slow │
│ ↓ │
│ Normalized ×ばつ96 RGB Image: [0, 1] range │
└──────────────┬───────────────────────────────────────────────────┘
 │
 ▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 4: FEATURE EXTRACTION (CNN) │
│ │
│ Input: ×ばつ3 Image │
│ ↓ │
│ Conv Block 1: ×ばつ3 kernels] → 32 features │
│ • BatchNorm → ReLU → MaxPool → Dropout(0.1) │
│ • Learns: Basic edges, textures │
│ ↓ │
│ Conv Block 2: ×ばつ5 kernels] → 64 features ← LARGER! │
│ • BatchNorm → ReLU → MaxPool → Dropout(0.15) │
│ • Learns: Motion blur, page curvature │
│ • Why ×ばつ5? Captures broader patterns │
│ ↓ │
│ Conv Block 3: ×ばつ3 kernels] → 128 features │
│ • BatchNorm → ReLU → MaxPool → Dropout(0.2) │
│ • Learns: Hand shapes, page positions │
│ ↓ │
│ Conv Block 4: ×ばつ3 kernels] → 192 features │
│ • BatchNorm → ReLU → Global Avg Pool │
│ • Learns: High-level flip patterns │
│ ↓ │
│ Image Features: 192-dimensional vector │
│ │
│ Key Design: Multi-scale kernels [3,5,3,3] │
│ • ×ばつ3: Fine details (edges, textures) │
│ • ×ばつ5: Broader patterns (motion, curvature) │
└──────────────┬───────────────────────────────────────────────────┘
 │
 ▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 5: FEATURE FUSION │
│ │
│ Image Features (192) + Motion Features (3) → 195 dimensions │
│ ↓ │
│ Dense Layer: 195 → 96 neurons │
│ • BatchNorm → ReLU → Dropout(0.3) │
│ • Combines: "What I see" + "How it's changing" │
│ ↓ │
│ Fused Features: 96-dimensional vector │
└──────────────┬───────────────────────────────────────────────────┘
 │
 ▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 6: CLASSIFICATION │
│ │
│ Fused Features (96) │
│ ↓ │
│ Classification Layer: 96 → 1 neuron │
│ ↓ │
│ Sigmoid Activation → Probability [0, 1] │
│ ↓ │
│ Threshold: 0.15 (optimized, NOT default 0.5) │
│ ↓ │
│ Final Prediction: │
│ • Probability > 0.15 → "FLIP" (1) │
│ • Probability ≤ 0.15 → "NOT FLIP" (0) │
│ │
│ Why 0.15? Maximizes F1 score on validation set │
└──────────────────────────────────────────────────────────────────┘

Model Architecture (Technical View)

×ばつ3) │ │ + BatchNorm + ReLU │ │ + MaxPool(×ばつ2) │ │ + Dropout2D(0.1) │ │ Output: (batch, 32, 48, 48)│ ├─────────────────────────────────────┤ │ Block 2: Conv2D(32→64, ×ばつ5) ← BIG! │ │ + BatchNorm + ReLU │ │ + MaxPool(×ばつ2) │ │ + Dropout2D(0.15) │ │ Output: (batch, 64, 24, 24)│ ├─────────────────────────────────────┤ │ Block 3: Conv2D(64→128, ×ばつ3) │ │ + BatchNorm + ReLU │ │ + MaxPool(×ばつ2) │ │ + Dropout2D(0.2) │ │ Output: (batch, 128, 12, 12)│ ├─────────────────────────────────────┤ │ Block 4: Conv2D(128→192, ×ばつ3) │ │ + BatchNorm + ReLU │ │ + GlobalAvgPool │ │ Output: (batch, 192) │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ FEATURE FUSION LAYER │ ├─────────────────────────────────────┤ │ Concatenate: │ │ Image Features (192) │ │ + Motion Features (3) │ │ = Combined (195) │ ├─────────────────────────────────────┤ │ Dense(195→96) │ │ + BatchNorm + ReLU + Dropout(0.3) │ │ Output: (batch, 96) │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ CLASSIFICATION HEAD │ ├─────────────────────────────────────┤ │ Dense(96→1) + Sigmoid │ │ Output: (batch, 1) [0, 1] │ └─────────────────────────────────────┘ Total Parameters: 1,274,753 (~1.27M) Model Size: 4.86 MB">

Input Layer:
 • Image: (batch, 3, 96, 96)
 • Motion: (batch, 3)
┌─────────────────────────────────────┐
│ CONVOLUTIONAL FEATURE EXTRACTOR │
├─────────────────────────────────────┤
│ Block 1: Conv2D(3→32, ×ばつ3) │
│ + BatchNorm + ReLU │
│ + MaxPool(×ばつ2) │
│ + Dropout2D(0.1) │
│ Output: (batch, 32, 48, 48)│
├─────────────────────────────────────┤
│ Block 2: Conv2D(32→64, ×ばつ5) ← BIG! │
│ + BatchNorm + ReLU │
│ + MaxPool(×ばつ2) │
│ + Dropout2D(0.15) │
│ Output: (batch, 64, 24, 24)│
├─────────────────────────────────────┤
│ Block 3: Conv2D(64→128, ×ばつ3) │
│ + BatchNorm + ReLU │
│ + MaxPool(×ばつ2) │
│ + Dropout2D(0.2) │
│ Output: (batch, 128, 12, 12)│
├─────────────────────────────────────┤
│ Block 4: Conv2D(128→192, ×ばつ3) │
│ + BatchNorm + ReLU │
│ + GlobalAvgPool │
│ Output: (batch, 192) │
└─────────────────────────────────────┘
 │
 ▼
┌─────────────────────────────────────┐
│ FEATURE FUSION LAYER │
├─────────────────────────────────────┤
│ Concatenate: │
│ Image Features (192) │
│ + Motion Features (3) │
│ = Combined (195) │
├─────────────────────────────────────┤
│ Dense(195→96) │
│ + BatchNorm + ReLU + Dropout(0.3) │
│ Output: (batch, 96) │
└─────────────────────────────────────┘
 │
 ▼
┌─────────────────────────────────────┐
│ CLASSIFICATION HEAD │
├─────────────────────────────────────┤
│ Dense(96→1) + Sigmoid │
│ Output: (batch, 1) [0, 1] │
└─────────────────────────────────────┘
Total Parameters: 1,274,753 (~1.27M)
Model Size: 4.86 MB

🧪 Key Concepts & Nuances (Crystal Clear)

Concept 1: Why Both Image AND Motion Features?

Problem: Images alone don't capture dynamics.

Scenario: Hand positioned over page
Static Image Says: Motion Features Add:
"Hand near page" "Hand moving fast" → Likely flip
 OR "Hand still" → Just hovering

The Magic: Combining both gives complete picture:

Image CNN: Recognizes WHAT is in frame (hand, page, book)
- Like looking at a photo
Motion Features: Recognizes HOW things are changing (speed, uniformity, peaks)
- Like comparing two photos side-by-side

Result: Model understands the action happening, not just the scene frozen in time.

Analogy:

Image alone = Seeing someone with raised hand → Are they waving? Reaching? Stretching?
Image + Motion = Seeing raised hand + detecting fast sideways movement → They're waving!

Concept 2: Why These Specific Motion Features?

motion_features = [mean_motion, std_motion, max_motion]

Intuition:

mean_motion (Average):

High mean → Lots of pixels changing → Something moving
Low mean → Few pixels changing → Mostly static
Page flip: HIGH (whole page moving)
Hand adjust: LOW (only small region)

std_motion (Standard Deviation):

High std → Motion not uniform → Some areas move more
Low std → Motion uniform → Everything moves similarly
Page flip: HIGH (edges move fast, center slower)
Camera shake: LOW (everything moves uniformly)

max_motion (Maximum):

High max → Sharp edge movements detected
Low max → Smooth gradual changes
Page flip: HIGH (page edge creates sharp motion)
Slow adjustment: LOW (gentle movement)

Why Not Optical Flow? (Optical Flow = Fancy motion tracking method)

Comparison:

Optical Flow: Like tracking every single object's movement with GPS
- Very accurate but SLOW (100ms+)
- Overkill for our problem
Our Method: Like checking "did things move a lot, unevenly, and sharply?"
- Simple math but works great (5ms)
- Fast enough for real-time

Key Lesson: Don't use a sledgehammer to crack a nut.

Don't add complexity because you CAN
Add complexity because you MUST

Our simple method works just as well at ×ばつ the speed!

Concept 3: Multi-Scale Feature Extraction

×ばつ3, ×ばつ5, ×ばつ3, ×ばつ3]? ┌──────────────────────────────────────┐ │ ×ばつ3 Kernels (Blocks 1, 3, 4): │ │ │ │ ██ ← Sees ×ばつ3 region │ │ Small receptive field │ │ Captures: Fine details │ │ • Page edges │ │ • Finger textures │ │ • Text patterns │ └──────────────────────────────────────┘ ┌──────────────────────────────────────┐ │ ×ばつ5 Kernel (Block 2): │ │ │ │ ████ ← Sees ×ばつ5 region │ │ ████ │ │ Larger receptive field │ │ Captures: Broader patterns │ │ • Motion blur extent │ │ • Page curvature │ │ • Hand-page relationship │ └──────────────────────────────────────┘">

Why Varied Kernel Sizes ×ばつ3, ×ばつ5, ×ばつ3, ×ばつ3]?
┌──────────────────────────────────────┐
│ ×ばつ3 Kernels (Blocks 1, 3, 4): │
│ │
│ ██ ← Sees ×ばつ3 region │
│ Small receptive field │
│ Captures: Fine details │
│ • Page edges │
│ • Finger textures │
│ • Text patterns │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│ ×ばつ5 Kernel (Block 2): │
│ │
│ ████ ← Sees ×ばつ5 region │
│ ████ │
│ Larger receptive field │
│ Captures: Broader patterns │
│ • Motion blur extent │
│ • Page curvature │
│ • Hand-page relationship │
└──────────────────────────────────────┘

Why This Matters:

Fast flips: Create broad motion blur → ×ばつ5 catches it
Slow flips: Sharp page edges → ×ばつ3 catches it
Result: Robust to varying flip speeds

Concept 4: Threshold Optimization (Why Not 0.5?)

Default Thinking:

prediction = 1 if probability > 0.5 else 0

Problem: 0.5 is arbitrary!

Our Approach: Test many thresholds, pick best F1 score

Threshold | Precision | Recall | F1 | What This Means
──────────┼───────────┼────────┼───────┼─────────────────────────
0.10 | 0.93 | 0.97 | 0.95 | Catch more, some false alarms
0.15 | 0.96 | 0.96 | 0.96 | ← OPTIMAL (balanced)
0.20 | 0.97 | 0.95 | 0.96 | Slightly more conservative
0.50 | 0.99 | 0.88 | 0.93 | Too conservative (misses flips)
0.90 | 1.00 | 0.67 | 0.80 | Way too conservative

Simple Explanation: Our model learned conservatively - it gives lower probability scores even when correct.

Why?: The training data had more "not-flip" examples than "flip" examples (data imbalance), so the model learned to be cautious.

The Fix:

Using 0.5 threshold → Misses 12% of flips (too strict)
Using 0.15 threshold → Catches 96% of flips (just right!)

Analogy: If your spam filter requires 90% certainty to mark spam, it might miss obvious spam emails. Lower the threshold to 50% certainty, and you catch more spam without many false alarms.

Interview Answer: "I optimized the threshold by testing values from 0.1 to 0.9 on the validation set and selecting the one that maximizes F1 score. The optimal threshold of 0.15 (not 0.5) accounts for class distribution and achieves the best balance between catching flips (recall) and avoiding false alarms (precision)."

Concept 5: Regularization - Why So Much?

The Risk: Small dataset + deep network = overfitting

What is Overfitting? (Simple explanation)

Model memorizes training data like cramming exam answers
Gets 100% on practice test but fails real exam
Learns specific examples, not general patterns

What is Underfitting? (Simple explanation)

Model too simple to learn even basic patterns
Like using a ruler to draw curves
Bad on both training and testing

The Sweet Spot: Model that learns patterns (not memorizes examples) and works on new data

👉 For full explanation with analogies, see Training Strategy - Overfitting & Underfitting

Our Defense (5 techniques):

Dropout (Progressive: 0.1 → 0.15 → 0.2 → 0.3):

Why increasing?
Early layers: Learn basic features (edges) → Need less regularization
Late layers: Learn complex patterns → More prone to overfitting

L2 Regularization (Weight Decay = 0.0001):

Penalizes large weights
Encourages simpler model

Batch Normalization (Every layer):

Stabilizes training
Acts as regularization (adds noise)

Early Stopping (Patience = 3):

Stops when validation stops improving
Prevents overfitting to training set

Data Augmentation (Rotation ±5°, Brightness ×ばつ):

Creates variations of training data
Model sees more diverse examples

Why All Five?: Each addresses overfitting from different angle. Combined effect is very robust.

📖 Documentation Structure

For Interview Preparation (Read in Order):

Quick Reference (5 min)
- 30-second elevator pitch
- Key metrics and decisions table
- Last-minute interview soundbites
Project Overview (15 min)
- Business context and motivation
- Why this approach?
- Success criteria
Architecture (30 min)
- Complete system design
- Layer-by-layer breakdown
- Design decision rationale
Data Pipeline (20 min)
- Motion feature extraction
- Image preprocessing steps
- Caching and optimization
Training Strategy (30 min)
- Loss function (BCE) explained
- Regularization techniques
- Training noise and validation patterns
- Learning rate observations
Evaluation & Results (25 min)
- Metrics deep dive (Precision, Recall, F1)
- Threshold optimization process
- Real training curves with anomalies
Mentor Feedback #1 (15 min)
- First mentor discussion insights
- Validation > training explanation
Mentor Insights #2 (20 min)
- CRITICAL: Single-frame vs sequence decision
- Why text/content is irrelevant
- Simplicity vs complexity philosophy
Complete Pipeline (30 min)
- 6-stage pipeline flow
- Every preprocessing step explained
- Jargon glossary (all terms defined)
Visualization Analysis (25 min)
- Frame distribution chart lessons (honest mistake)
- Preprocessing image analysis
- Training metrics deep dive
- Epoch 3 anomaly explained
- Interview Q&A for every visualization
Study Guide (20 min)
- How to study this project
- 12 essential interview questions
- Pre-interview checklist
Complete Interview Questions (90 min) ⭐
- 30+ interview questions with crystal-clear answers
- Simple explanations + technical versions + analogies
- Organized by category (Overview, Architecture, Training, etc.)
- Quick reference section for last-minute prep
- Real answers from actual training experience

Total Study Time: ~5-6 hours for complete mastery

🚀 Quick Start

Prerequisites

# Python 3.8+
python --version
# Required libraries
pip install torch torchvision
pip install numpy pandas matplotlib seaborn
pip install scikit-learn opencv-python pillow tqdm

Data Setup

images/
├── training/
│ ├── flip/ # Page flip frames
│ └── notflip/ # Normal frames
└── testing/
 ├── flip/
 └── notflip/

Run Training

Open page_flip_detection_Sys.ipynb
Update data path: base_path = "/path/to/your/images"
Run all cells

Expected outputs:

Training history plots
Confusion matrix
Test metrics
Saved model: best_model_optimized.pth

Key Configuration

×ばつ96 sweet spot) BATCH_SIZE = 128 # Batch size for training NUM_EPOCHS = 10 # Maximum epochs (early stopping triggers earlier) LEARNING_RATE = 0.001 # Initial learning rate (never reduced in our training!) EARLY_STOP_PATIENCE = 3 # Epochs without improvement before stopping USE_MOTION_FEATURES = True # Enable motion features (critical!)">

IMAGE_SIZE = 96 # Input image size (×ばつ96 sweet spot)
BATCH_SIZE = 128 # Batch size for training
NUM_EPOCHS = 10 # Maximum epochs (early stopping triggers earlier)
LEARNING_RATE = 0.001 # Initial learning rate (never reduced in our training!)
EARLY_STOP_PATIENCE = 3 # Epochs without improvement before stopping
USE_MOTION_FEATURES = True # Enable motion features (critical!)

🎯 Interview Preparation

📚 Complete Interview Guide

👉 See Complete Interview Questions (30+ Q&A)

All questions organized by category with:

✅ Simple explanations with analogies
✅ Technical detailed answers
✅ Real examples from training
✅ Quick reference for last-minute prep

Q1: "Walk me through your project in 2 minutes"

30-Second Version: "Page flip detector for blind users. CNN + motion features. 96% F1, 20ms inference. Key insight: single-frame sufficient, no LSTM needed."

Full Answer: "I built a page flip detector for MonReader, a mobile document scanning app for blind users who need hands-free scanning.

Problem: Traditional scanning requires button taps per page - impossible for blind users.

Solution: Real-time CNN combining image features (what's in frame) with motion features (how things change) to detect page flips automatically.

Results: 96% F1 score, 20-50ms inference - production-ready.

Key Insight: Mentor showed each frame contains all info needed - no LSTM required. Simplicity wins: ×ばつ faster with same accuracy."

Q2: "What is overfitting and how did you prevent it?"

Simple: Model memorizes training data like cramming exam answers. Gets 100% on practice test, fails real exam.

How I prevented it:

Dropout (0.1→0.3): Randomly turn off neurons
L2 Regularization: Penalize large weights
Early Stopping: Stop before memorization
Data Augmentation: Harder to memorize variations
Batch Normalization: Add noise

Result: Train 89%, Test 94% (healthy! ✓)

Q3: "Why did validation outperform training?"

Simple: Like taking a test with full brain power (validation) vs studying with distractions (training).

Technical Reasons:

Dropout OFF during validation → Full capacity
No augmentation during validation → Easier samples
Gap is small (5%) and both high → Healthy!

When it's a problem: Gap >10%, or train suspiciously low

Q4: "Why F1 score instead of accuracy?"

Spam Filter Analogy:

Dumb filter: "Everything is NOT SPAM" → 95% accuracy but catches ZERO spam!
Smart filter: Uses F1 → Balances catching spam (recall) with accuracy (precision)

Our Case:

Accuracy: Can be misled by class imbalance
F1: Balances precision (user trust) + recall (completeness)
Both at 96% → Production-ready

Q5: "What was your biggest challenge?"

Problem: Distinguishing page flips from hand adjustments, camera shake, book rotation.

Solution: Motion features create unique flip signature:

Mean: HIGH (lots of movement)
Std: HIGH (uneven - edges move more)
Max: HIGH (sharp page edge)

vs other motion patterns (LOW, LOW, MEDIUM)

Result: 96% F1 vs 72% with image only

🔥 Pro Tips for Interviews

Start simple, go deep: Begin with analogy, then technical details if asked
Use numbers: "96% F1, 20ms inference" is concrete
Show growth: Mention the frame distribution chart mistake
Connect to business: Always link technical choices to user impact
Be honest: "I learned this from my mentor" shows collaboration

👉 See All 30+ Questions & Answers

🔧 Troubleshooting

Out of Memory Error

BATCH_SIZE = 64 # Reduce from 128
IMAGE_SIZE = 64 # Reduce from 96

Slow Training

num_workers = 8 # Increase for faster data loading
persistent_workers = True # Keep workers alive between epochs

Poor Performance

Check data quality (visualize samples)
Verify class balance (should be roughly balanced)
Check training curves (look for overfitting)
Try lower threshold (improve recall)

🌟 Key Takeaways

Multi-modal learning works: Combining image + motion features significantly outperforms either alone
Simplicity wins: Single-frame classification is sufficient, no need for complex sequence models
Threshold matters: Default 0.5 is often suboptimal - optimize based on validation F1
Training is noisy: Focus on trends over 3-5 epochs, not individual epoch drops
Regularization is essential: Multiple techniques prevent overfitting on limited data
Intentional analysis: Not all visualizations are useful - ask what question each one answers
Honest self-assessment: Admitting mistakes (frame distribution chart)

📚 References & Tools

Technologies Used

PyTorch - Deep learning framework
OpenCV - Image processing
scikit-learn - Metrics and evaluation

Key Concepts

Convolutional Neural Networks (CNN)
Binary Cross-Entropy Loss
Batch Normalization & Dropout
Multi-modal learning (image + motion)
Threshold optimization
F1 Score for imbalanced classification

📝 License

MIT License - See LICENSE file for details

👤 Contact

Krishna Balachandran Nair

🙏 Acknowledgments

MonReader Team - Project context and real-world application
Mentor Guidance - Critical insights on single-frame sufficiency, simplicity, and meaningful analysis
PyTorch Community - Excellent documentation and tutorials

Built for MonReader: Making document scanning fully automatic, fast, and accessible for everyone.

krishna11-dot/pageflip-cv-detection

Folders and files

Latest commit

History

Repository files navigation