Intelligent page flip detection for automatic document scanning in MonReader
MonReader is a mobile document digitization application designed for:
- π The blind and visually impaired - Hands-free document scanning
- π¬ Researchers - Bulk document scanning in seconds
- π Everyone - Fully automatic, high-speed, high-quality scanning
The Core Challenge: MonReader must automatically detect when a user flips a page to trigger high-resolution capture, corner detection, dewarping, and OCR - all without requiring the user to tap a button.
Traditional button-based scanning is:
- β Slow: Requires manual interaction per page
- β Error-prone: Users must frame shots perfectly
- β Not accessible: Blind users cannot aim cameras precisely
What we need: A system that watches low-resolution camera preview and automatically detects the exact moment of page flip to capture a perfect shot.
Page flip detection requires understanding both:
-
What's in the frame: Hand, page, book position
- Like looking at a photo - what do you see?
-
How things are changing: Motion patterns during flip
- Like watching a video - what's moving and how?
A simple motion detector would trigger on any movement (hand adjusting, turning the book, camera shake). We need to specifically recognize the unique movement pattern of a page flip.
A deep learning-based page flip detector that:
- β Processes single frames from low-resolution camera preview
- β Detects page flips in 20-50ms per frame (real-time capable)
- β Combines image features (CNN) with motion features (frame differencing)
- β Achieves 96% F1 score - reliable enough for production use
User Action: MonReader Response:
1. Point camera at book β Live preview (low-res)
2. Flip page β Flip detected! (our model)
3. Continue flipping β High-res capture triggered
β Auto crop, dewarp, OCR
β Next page ready
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SINGLE FRAME CLASSIFICATION β
β (No sequence modeling - simpler, faster) β
β β
β Input: Current Frame (Γγ°γ€96 RGB) β
β + β
β Motion Features (3 values: mean, std, max) β
β β β
β CNN extracts spatial features (what's in frame) β
β Motion features provide temporal context (how changing)β
β β β
β Feature Fusion combines both information streams β
β β β
β Binary Classification: Flip (1) or Not-Flip (0) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why This Design?
β Rejected: LSTM/RNN for sequence modeling (complex, slow, unnecessary) β Chosen: Single-frame CNN + motion features (simple, fast, sufficient)
Key Insight from Mentor: Each frame contains all information needed to detect a flip.
Think of it like this:
- β Don't need: A video of 10 frames to see the pattern
- β Do need: Just 1 snapshot + how much things moved
The motion pattern, hand position, and page curvature in a single moment are enough - no need to analyze sequences of frames.
| Metric | Target | Achieved | Business Impact |
|---|---|---|---|
| False Positive Rate | <5% | 3.2% | Users won't get frustrated by accidental triggers |
| Recall (Catch Rate) | >90% | 95.5% | Catches nearly all flips - complete scanning |
| Inference Speed | <100ms | 20-50ms | Real-time response in mobile app |
| Model Size | <10MB | 4.86MB | Fits on mobile devices |
Performance on Test Set:
ββ F1 Score: 0.96 β (Excellent balance)
ββ Accuracy: 0.96 β (High correctness)
ββ Precision: 0.96 β (96% of "flip" predictions correct)
ββ Recall: 0.96 β (Catches 96% of actual flips)
ββ Specificity: 0.97 β (97% of non-flips correctly ignored)
What This Means:
- Out of 100 flips, we catch 96 and miss 4
- Out of 100 "flip" alerts, 96 are real and 4 are false alarms
- Production-ready performance
Motion Statistics During Different Actions:
Action β Mean Motion β Std Motion β Max Motion β Our Prediction
βββββββββββββββββββββββΌββββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββββββ
Page Flip β HIGH β HIGH β HIGH β β FLIP β
Hand Adjusting β LOW β LOW β MEDIUM β β NOT FLIP β
Camera Shake β MEDIUM β LOW β MEDIUM β β NOT FLIP β
Turning Book β MEDIUM β MEDIUM β MEDIUM β β NOT FLIP β
Static Reading β VERY LOW β VERY LOW β VERY LOW β β NOT FLIP β
Key Insight: Page flips have a unique motion signature - high overall motion (mean), non-uniform motion (high std), and sharp edge movements (high max).
Initially considered: LSTM to model sequences of frames nsight: "Each frame contains all information needed" Result: Simpler CNN approach works just as well, Γγ°γ€ faster
Why single-frame works:
- Page curvature visible in one frame
- Hand position indicates flip action
- Motion features provide temporal context
- Action is instantaneous enough
Mistake Made: Created frame distribution histogram (flip vs not-flip counts) Mentor Question: "What do you get from this chart?" Honest Answer: "Nothing much, just frame distribution" Lesson Learned: Always ask:
- What question does this visualization answer?
- What decision does it inform?
- Does it provide actionable insight?
This taught me to be intentional with analysis rather than creating visualizations for their own sake.
Real Training Example:
Epoch 3 Anomaly:
Val F1 dropped from 0.82 β 0.35 β 0.84
What happened?
- Model became overly cautious (100% precision, 21% recall)
Translation: Only said "flip" when 100% sure
But missed 79% of actual flips!
- Temporary stuck point during learning
- Fixed itself in next epoch
Lesson: Don't panic when one training round looks bad
Look at the big picture (is it improving overall?)
Simple Explanation: "Training has natural randomness - like flipping a coin, you might get 3 heads in a row even though it should be 50/50. One bad epoch doesn't mean failure. What matters is: Are things getting better when you look at 3-5 training rounds together?"
Interview Version: "This taught me that training has inherent randomness. Individual epochs can fluctuate, but what matters is the overall pattern across 3-5 epochs. In Epoch 3, my model temporarily became too cautious and performance dipped, but by Epoch 4 it recovered and continued improving. This is normal in deep learning."
Final Results:
Training: 89% accuracy, F1=0.86
Validation: 94% accuracy, F1=0.90
Gap: 5% (HEALTHY)
Why This Happens (Simple Explanation):
Think of it like taking a test:
-
During training: Some brain cells randomly turned off (dropout), questions made harder (augmentation)
- Like studying with distractions and harder practice problems
-
During validation: Full brain power, normal difficulty questions
- Like taking the actual test in quiet room with standard questions
So validation being slightly better is normal!
When it's a problem:
- Gap >10% (validation WAY better) β Something's wrong
- Training accuracy too low β Model not learning properly
When it's healthy:
- Small gap (<5%) β This is normal! β
- Both metrics high β Model learned well β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage 1: VIDEO INPUT β
β Camera Preview β Extract Frames β Store in memory β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage 2: MOTION FEATURE EXTRACTION β
β β
β Frame[i] - Frame[i-1] = Difference Image β
β β β
β Convert to grayscale β
β β β
β Calculate: β
β β’ mean_motion: Average pixel change (overall activity) β
β β’ std_motion: Motion uniformity (edge emphasis) β
β β’ max_motion: Peak intensity (sharp movements) β
β β β
β 3-dimensional motion vector: [mean, std, max] β
β β
β CACHED: Saved to disk (30 min β 2 sec on reruns) β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage 3: IMAGE PREPROCESSING β
β β
β Original Frame (varying sizes) β
β β β
β 1. Crop to Center (focus on action area) β
β β β
β 2. Contrast Enhancement (Γγ°γ€1.2) β
β Why? Sharpen page edges and hand boundaries β
β β β
β 3. Sharpness Enhancement (Γγ°γ€1.1) β
β Why? Emphasize motion blur patterns β
β β β
β 4. Resize to Γγ°γ€96 pixels β
β Why? Balance: Γγ°γ€56 too grainy, Γγ°γ€224 too slow β
β β β
β Normalized Γγ°γ€96 RGB Image: [0, 1] range β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage 4: FEATURE EXTRACTION (CNN) β
β β
β Input: Γγ°γ€3 Image β
β β β
β Conv Block 1: Γγ°γ€3 kernels] β 32 features β
β β’ BatchNorm β ReLU β MaxPool β Dropout(0.1) β
β β’ Learns: Basic edges, textures β
β β β
β Conv Block 2: Γγ°γ€5 kernels] β 64 features β LARGER! β
β β’ BatchNorm β ReLU β MaxPool β Dropout(0.15) β
β β’ Learns: Motion blur, page curvature β
β β’ Why Γγ°γ€5? Captures broader patterns β
β β β
β Conv Block 3: Γγ°γ€3 kernels] β 128 features β
β β’ BatchNorm β ReLU β MaxPool β Dropout(0.2) β
β β’ Learns: Hand shapes, page positions β
β β β
β Conv Block 4: Γγ°γ€3 kernels] β 192 features β
β β’ BatchNorm β ReLU β Global Avg Pool β
β β’ Learns: High-level flip patterns β
β β β
β Image Features: 192-dimensional vector β
β β
β Key Design: Multi-scale kernels [3,5,3,3] β
β β’ Γγ°γ€3: Fine details (edges, textures) β
β β’ Γγ°γ€5: Broader patterns (motion, curvature) β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage 5: FEATURE FUSION β
β β
β Image Features (192) + Motion Features (3) β 195 dimensions β
β β β
β Dense Layer: 195 β 96 neurons β
β β’ BatchNorm β ReLU β Dropout(0.3) β
β β’ Combines: "What I see" + "How it's changing" β
β β β
β Fused Features: 96-dimensional vector β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage 6: CLASSIFICATION β
β β
β Fused Features (96) β
β β β
β Classification Layer: 96 β 1 neuron β
β β β
β Sigmoid Activation β Probability [0, 1] β
β β β
β Threshold: 0.15 (optimized, NOT default 0.5) β
β β β
β Final Prediction: β
β β’ Probability > 0.15 β "FLIP" (1) β
β β’ Probability β€ 0.15 β "NOT FLIP" (0) β
β β
β Why 0.15? Maximizes F1 score on validation set β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input Layer:
β’ Image: (batch, 3, 96, 96)
β’ Motion: (batch, 3)
βββββββββββββββββββββββββββββββββββββββ
β CONVOLUTIONAL FEATURE EXTRACTOR β
βββββββββββββββββββββββββββββββββββββββ€
β Block 1: Conv2D(3β32, Γγ°γ€3) β
β + BatchNorm + ReLU β
β + MaxPool(Γγ°γ€2) β
β + Dropout2D(0.1) β
β Output: (batch, 32, 48, 48)β
βββββββββββββββββββββββββββββββββββββββ€
β Block 2: Conv2D(32β64, Γγ°γ€5) β BIG! β
β + BatchNorm + ReLU β
β + MaxPool(Γγ°γ€2) β
β + Dropout2D(0.15) β
β Output: (batch, 64, 24, 24)β
βββββββββββββββββββββββββββββββββββββββ€
β Block 3: Conv2D(64β128, Γγ°γ€3) β
β + BatchNorm + ReLU β
β + MaxPool(Γγ°γ€2) β
β + Dropout2D(0.2) β
β Output: (batch, 128, 12, 12)β
βββββββββββββββββββββββββββββββββββββββ€
β Block 4: Conv2D(128β192, Γγ°γ€3) β
β + BatchNorm + ReLU β
β + GlobalAvgPool β
β Output: (batch, 192) β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β FEATURE FUSION LAYER β
βββββββββββββββββββββββββββββββββββββββ€
β Concatenate: β
β Image Features (192) β
β + Motion Features (3) β
β = Combined (195) β
βββββββββββββββββββββββββββββββββββββββ€
β Dense(195β96) β
β + BatchNorm + ReLU + Dropout(0.3) β
β Output: (batch, 96) β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β CLASSIFICATION HEAD β
βββββββββββββββββββββββββββββββββββββββ€
β Dense(96β1) + Sigmoid β
β Output: (batch, 1) [0, 1] β
βββββββββββββββββββββββββββββββββββββββ
Total Parameters: 1,274,753 (~1.27M)
Model Size: 4.86 MB
Problem: Images alone don't capture dynamics.
Scenario: Hand positioned over page
Static Image Says: Motion Features Add:
"Hand near page" "Hand moving fast" β Likely flip
OR "Hand still" β Just hovering
The Magic: Combining both gives complete picture:
-
Image CNN: Recognizes WHAT is in frame (hand, page, book)
- Like looking at a photo
-
Motion Features: Recognizes HOW things are changing (speed, uniformity, peaks)
- Like comparing two photos side-by-side
Result: Model understands the action happening, not just the scene frozen in time.
Analogy:
- Image alone = Seeing someone with raised hand β Are they waving? Reaching? Stretching?
- Image + Motion = Seeing raised hand + detecting fast sideways movement β They're waving!
motion_features = [mean_motion, std_motion, max_motion]
Intuition:
-
mean_motion (Average):
High mean β Lots of pixels changing β Something moving Low mean β Few pixels changing β Mostly static Page flip: HIGH (whole page moving) Hand adjust: LOW (only small region) -
std_motion (Standard Deviation):
High std β Motion not uniform β Some areas move more Low std β Motion uniform β Everything moves similarly Page flip: HIGH (edges move fast, center slower) Camera shake: LOW (everything moves uniformly) -
max_motion (Maximum):
High max β Sharp edge movements detected Low max β Smooth gradual changes Page flip: HIGH (page edge creates sharp motion) Slow adjustment: LOW (gentle movement)
Why Not Optical Flow? (Optical Flow = Fancy motion tracking method)
Comparison:
-
Optical Flow: Like tracking every single object's movement with GPS
- Very accurate but SLOW (100ms+)
- Overkill for our problem
-
Our Method: Like checking "did things move a lot, unevenly, and sharply?"
- Simple math but works great (5ms)
- Fast enough for real-time
Key Lesson: Don't use a sledgehammer to crack a nut.
- Don't add complexity because you CAN
- Add complexity because you MUST
Our simple method works just as well at Γγ°γ€ the speed!
Why Varied Kernel Sizes Γγ°γ€3, Γγ°γ€5, Γγ°γ€3, Γγ°γ€3]?
ββββββββββββββββββββββββββββββββββββββββ
β Γγ°γ€3 Kernels (Blocks 1, 3, 4): β
β β
β ββ β Sees Γγ°γ€3 region β
β Small receptive field β
β Captures: Fine details β
β β’ Page edges β
β β’ Finger textures β
β β’ Text patterns β
ββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββ
β Γγ°γ€5 Kernel (Block 2): β
β β
β ββββ β Sees Γγ°γ€5 region β
β ββββ β
β Larger receptive field β
β Captures: Broader patterns β
β β’ Motion blur extent β
β β’ Page curvature β
β β’ Hand-page relationship β
ββββββββββββββββββββββββββββββββββββββββ
Why This Matters:
- Fast flips: Create broad motion blur β Γγ°γ€5 catches it
- Slow flips: Sharp page edges β Γγ°γ€3 catches it
- Result: Robust to varying flip speeds
Default Thinking:
prediction = 1 if probability > 0.5 else 0
Problem: 0.5 is arbitrary!
Our Approach: Test many thresholds, pick best F1 score
Threshold | Precision | Recall | F1 | What This Means
βββββββββββΌββββββββββββΌβββββββββΌββββββββΌβββββββββββββββββββββββββ
0.10 | 0.93 | 0.97 | 0.95 | Catch more, some false alarms
0.15 | 0.96 | 0.96 | 0.96 | β OPTIMAL (balanced)
0.20 | 0.97 | 0.95 | 0.96 | Slightly more conservative
0.50 | 0.99 | 0.88 | 0.93 | Too conservative (misses flips)
0.90 | 1.00 | 0.67 | 0.80 | Way too conservative
Simple Explanation: Our model learned conservatively - it gives lower probability scores even when correct.
Why?: The training data had more "not-flip" examples than "flip" examples (data imbalance), so the model learned to be cautious.
The Fix:
- Using 0.5 threshold β Misses 12% of flips (too strict)
- Using 0.15 threshold β Catches 96% of flips (just right!)
Analogy: If your spam filter requires 90% certainty to mark spam, it might miss obvious spam emails. Lower the threshold to 50% certainty, and you catch more spam without many false alarms.
Interview Answer: "I optimized the threshold by testing values from 0.1 to 0.9 on the validation set and selecting the one that maximizes F1 score. The optimal threshold of 0.15 (not 0.5) accounts for class distribution and achieves the best balance between catching flips (recall) and avoiding false alarms (precision)."
The Risk: Small dataset + deep network = overfitting
What is Overfitting? (Simple explanation)
- Model memorizes training data like cramming exam answers
- Gets 100% on practice test but fails real exam
- Learns specific examples, not general patterns
What is Underfitting? (Simple explanation)
- Model too simple to learn even basic patterns
- Like using a ruler to draw curves
- Bad on both training and testing
The Sweet Spot: Model that learns patterns (not memorizes examples) and works on new data
π For full explanation with analogies, see Training Strategy - Overfitting & Underfitting
Our Defense (5 techniques):
-
Dropout (Progressive: 0.1 β 0.15 β 0.2 β 0.3):
Why increasing? Early layers: Learn basic features (edges) β Need less regularization Late layers: Learn complex patterns β More prone to overfitting -
L2 Regularization (Weight Decay = 0.0001):
Penalizes large weights Encourages simpler model -
Batch Normalization (Every layer):
Stabilizes training Acts as regularization (adds noise) -
Early Stopping (Patience = 3):
Stops when validation stops improving Prevents overfitting to training set -
Data Augmentation (Rotation Β±5Β°, Brightness Γγ°γ€):
Creates variations of training data Model sees more diverse examples
Why All Five?: Each addresses overfitting from different angle. Combined effect is very robust.
-
Quick Reference (5 min)
- 30-second elevator pitch
- Key metrics and decisions table
- Last-minute interview soundbites
-
Project Overview (15 min)
- Business context and motivation
- Why this approach?
- Success criteria
-
Architecture (30 min)
- Complete system design
- Layer-by-layer breakdown
- Design decision rationale
-
Data Pipeline (20 min)
- Motion feature extraction
- Image preprocessing steps
- Caching and optimization
-
Training Strategy (30 min)
- Loss function (BCE) explained
- Regularization techniques
- Training noise and validation patterns
- Learning rate observations
-
Evaluation & Results (25 min)
- Metrics deep dive (Precision, Recall, F1)
- Threshold optimization process
- Real training curves with anomalies
-
Mentor Feedback #1 (15 min)
- First mentor discussion insights
- Validation > training explanation
-
Mentor Insights #2 (20 min)
- CRITICAL: Single-frame vs sequence decision
- Why text/content is irrelevant
- Simplicity vs complexity philosophy
-
Complete Pipeline (30 min)
- 6-stage pipeline flow
- Every preprocessing step explained
- Jargon glossary (all terms defined)
-
Visualization Analysis (25 min)
- Frame distribution chart lessons (honest mistake)
- Preprocessing image analysis
- Training metrics deep dive
- Epoch 3 anomaly explained
- Interview Q&A for every visualization
-
Study Guide (20 min)
- How to study this project
- 12 essential interview questions
- Pre-interview checklist
-
Complete Interview Questions (90 min) β
- 30+ interview questions with crystal-clear answers
- Simple explanations + technical versions + analogies
- Organized by category (Overview, Architecture, Training, etc.)
- Quick reference section for last-minute prep
- Real answers from actual training experience
Total Study Time: ~5-6 hours for complete mastery
# Python 3.8+ python --version # Required libraries pip install torch torchvision pip install numpy pandas matplotlib seaborn pip install scikit-learn opencv-python pillow tqdm
images/
βββ training/
β βββ flip/ # Page flip frames
β βββ notflip/ # Normal frames
βββ testing/
βββ flip/
βββ notflip/
- Open
page_flip_detection_Sys.ipynb - Update data path:
base_path = "/path/to/your/images" - Run all cells
Expected outputs:
- Training history plots
- Confusion matrix
- Test metrics
- Saved model:
best_model_optimized.pth
IMAGE_SIZE = 96 # Input image size (Γγ°γ€96 sweet spot) BATCH_SIZE = 128 # Batch size for training NUM_EPOCHS = 10 # Maximum epochs (early stopping triggers earlier) LEARNING_RATE = 0.001 # Initial learning rate (never reduced in our training!) EARLY_STOP_PATIENCE = 3 # Epochs without improvement before stopping USE_MOTION_FEATURES = True # Enable motion features (critical!)
π See Complete Interview Questions (30+ Q&A)
All questions organized by category with:
- β Simple explanations with analogies
- β Technical detailed answers
- β Real examples from training
- β Quick reference for last-minute prep
30-Second Version: "Page flip detector for blind users. CNN + motion features. 96% F1, 20ms inference. Key insight: single-frame sufficient, no LSTM needed."
Full Answer: "I built a page flip detector for MonReader, a mobile document scanning app for blind users who need hands-free scanning.
Problem: Traditional scanning requires button taps per page - impossible for blind users.
Solution: Real-time CNN combining image features (what's in frame) with motion features (how things change) to detect page flips automatically.
Results: 96% F1 score, 20-50ms inference - production-ready.
Key Insight: Mentor showed each frame contains all info needed - no LSTM required. Simplicity wins: Γγ°γ€ faster with same accuracy."
Simple: Model memorizes training data like cramming exam answers. Gets 100% on practice test, fails real exam.
How I prevented it:
- Dropout (0.1β0.3): Randomly turn off neurons
- L2 Regularization: Penalize large weights
- Early Stopping: Stop before memorization
- Data Augmentation: Harder to memorize variations
- Batch Normalization: Add noise
Result: Train 89%, Test 94% (healthy! β)
Simple: Like taking a test with full brain power (validation) vs studying with distractions (training).
Technical Reasons:
- Dropout OFF during validation β Full capacity
- No augmentation during validation β Easier samples
- Gap is small (5%) and both high β Healthy!
When it's a problem: Gap >10%, or train suspiciously low
Spam Filter Analogy:
- Dumb filter: "Everything is NOT SPAM" β 95% accuracy but catches ZERO spam!
- Smart filter: Uses F1 β Balances catching spam (recall) with accuracy (precision)
Our Case:
- Accuracy: Can be misled by class imbalance
- F1: Balances precision (user trust) + recall (completeness)
- Both at 96% β Production-ready
Problem: Distinguishing page flips from hand adjustments, camera shake, book rotation.
Solution: Motion features create unique flip signature:
- Mean: HIGH (lots of movement)
- Std: HIGH (uneven - edges move more)
- Max: HIGH (sharp page edge)
vs other motion patterns (LOW, LOW, MEDIUM)
Result: 96% F1 vs 72% with image only
- Start simple, go deep: Begin with analogy, then technical details if asked
- Use numbers: "96% F1, 20ms inference" is concrete
- Show growth: Mention the frame distribution chart mistake
- Connect to business: Always link technical choices to user impact
- Be honest: "I learned this from my mentor" shows collaboration
π See All 30+ Questions & Answers
BATCH_SIZE = 64 # Reduce from 128 IMAGE_SIZE = 64 # Reduce from 96
num_workers = 8 # Increase for faster data loading persistent_workers = True # Keep workers alive between epochs
- Check data quality (visualize samples)
- Verify class balance (should be roughly balanced)
- Check training curves (look for overfitting)
- Try lower threshold (improve recall)
-
Multi-modal learning works: Combining image + motion features significantly outperforms either alone
-
Simplicity wins: Single-frame classification is sufficient, no need for complex sequence models
-
Threshold matters: Default 0.5 is often suboptimal - optimize based on validation F1
-
Training is noisy: Focus on trends over 3-5 epochs, not individual epoch drops
-
Regularization is essential: Multiple techniques prevent overfitting on limited data
-
Intentional analysis: Not all visualizations are useful - ask what question each one answers
-
Honest self-assessment: Admitting mistakes (frame distribution chart)
- PyTorch - Deep learning framework
- OpenCV - Image processing
- scikit-learn - Metrics and evaluation
- Convolutional Neural Networks (CNN)
- Binary Cross-Entropy Loss
- Batch Normalization & Dropout
- Multi-modal learning (image + motion)
- Threshold optimization
- F1 Score for imbalanced classification
MIT License - See LICENSE file for details
Krishna Balachandran Nair
- GitHub: @yourusername
- LinkedIn: Your Profile
- Email: your.email@example.com
- MonReader Team - Project context and real-world application
- Mentor Guidance - Critical insights on single-frame sufficiency, simplicity, and meaningful analysis
- PyTorch Community - Excellent documentation and tutorials
Built for MonReader: Making document scanning fully automatic, fast, and accessible for everyone.