Systematic benchmark for deploying quantized Vision-Language Models on SWaP-constrained edge hardware.
EdgeVLM-Bench evaluates the three-way trade-off between accuracy, OOD robustness, and resource efficiency (memory, latency, model size) for post-training quantized VLMs. It is designed for engineers and researchers choosing a deployment configuration for a specific edge tier — from embedded platforms (Jetson Nano, ~4 GB) to edge servers (Jetson AGX Orin, ~16 GB).
Post-training quantization (PTQ) compresses VLMs to fit edge memory budgets, but introduces two underexplored failure modes:
- The robustness gap: 4-bit quantization amplifies accuracy degradation under distribution shift (sensor noise, weather, JPEG artifacts) beyond what clean-data accuracy suggests.
- Calibration sensitivity: the composition of calibration data — which modality, which domain — critically affects PTQ quality, especially in multimodal models.
This benchmark provides the tools to measure and compare both effects across a quantization grid, tied to concrete edge deployment budgets.
┌─────────────────────────────────────────────────────────────────┐
│ EdgeVLM-Bench │
│ │
│ VLM (CLIP/BLIP) │
│ │ │
│ ├──► Calibration ──► PTQ Engine ──► Quantized Model │
│ │ (balanced / (minmax / │ │
│ │ dark / gray AWQ / │ │
│ │ noise) SmoothQuant) │ │
│ │ │ │
│ └──────────────────┬────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ Clean Eval OOD Eval (×ばつ10 corruptions, │
│ (CIFAR-10) ×ばつ5 severities each) │
│ ─ zero-shot acc ─ accuracy under shift │
│ ─ feature drift ─ robustness score (RS) │
│ ─ mean corruption error (mCE) │
│ │ │ │
│ └───────────┬───────────┘ │
│ │ │
│ Profiling │
│ ─ inference latency (CUDA events) │
│ ─ peak GPU memory │
│ ─ estimated packed model size │
│ │ │
│ Pareto Analysis │
│ ─ feasible configs per edge tier │
│ ─ Pareto-frontier visualization │
└─────────────────────────────────────────────────────────────────┘
| Config | Method | Bits | Inspired by |
|---|---|---|---|
| FP16 | — | 16 | Baseline |
| W8A8-minmax | Per-channel symmetric | W8A8 | Classical PTQ |
| W8A8-SQ | Activation/weight migration | W8A8 | SmoothQuant, ICML 2023 |
| W4A8-AWQ | Activation-aware channel scaling | W4A8 | AWQ, MLSys 2024 |
| W4A8-minmax | Per-channel symmetric | W4A8 | Ablation |
| Category | Corruptions |
|---|---|
| Noise | Gaussian noise, shot noise, impulse noise |
| Blur | Defocus blur, motion blur |
| Weather | Fog, brightness |
| Digital | Contrast, JPEG compression, pixelate |
Each evaluated at severities 1–5.
| Tier | Platform | Memory Budget | Latency Budget |
|---|---|---|---|
| Embedded | Jetson Nano / Xavier NX | 4 GB | 150 ms |
| Mobile | Snapdragon 8 Gen 3 | 8 GB | 80 ms |
| Edge Server | Jetson AGX Orin | 16 GB | 30 ms |
(Reproduced on NVIDIA RTX 6000 Ada, CLIP ViT-B/32, CIFAR-10 zero-shot, 1 000 eval samples.)
Config Clean Acc RS (↑) mCE (↓) Size (MB) Compression
──────────────────────────────────────────────────────────────────────
FP16 0.901 0.516 0.436 577.1 ×ばつ
W8A8-minmax 0.903 0.515 0.438 224.7 ×ばつ
W8A8-SQ 0.896 0.523 0.428 224.7 ×ばつ
W4A8-minmax 0.677 0.419 0.393 165.9 ×ばつ
W4A8-AWQ* 0.417 0.460 0.225 165.9 ×ばつ
Key findings
W8 quantization preserves accuracy and robustness nearly exactly — W8A8-minmax matches FP16 within 0.2% on both clean accuracy and robustness score at ×ばつ compression. This makes W8 the safe default for edge tiers where memory, not compute, is the binding constraint.
W4 quantization reveals the robustness gap. W4A8-minmax drops 22.4 pp clean accuracy (0.901 → 0.677) but loses 18.8% of its robustness score (0.516 → 0.419). The model's relative sensitivity to corruptions increases under 4-bit compression — a floor effect that clean-accuracy benchmarks miss entirely.
AWQ alpha sensitivity on vision encoders. W4A8-AWQ with a fixed α=0.5 underperforms plain minmax at 4 bits (0.417 vs 0.677 clean accuracy). AWQ's per-channel activation scaling was designed for LLMs where activation outliers concentrate in specific channels; CLIP's vision encoder has a different activation distribution, and α needs architecture-specific tuning. This is a concrete instance of the calibration-sensitivity problem studied in On the Sensitivity of Data-Driven Quantization in Vision–Language Models.
*W4A8-AWQ uses α=0.5 (fixed); tuning α per-layer is expected to recover significant accuracy.
This repo uses fake quantization — weights are stored as int16 but dequantized to FP32 at runtime. This correctly measures representational quantization error without requiring vendor kernels, but fake-quant is slower than FP16 on GPU because it adds a cast-and-scale step. The measured latencies reflect this:
Config Latency (ms) Model Size (MB)
─────────────────────────────────────────────
FP16 14.7 ms 577 MB ← native CUDA FP16 kernels
W8A8-SQ 94.4 ms 225 MB ← fake-quant overhead
W4A8-AWQ 90.9 ms 166 MB
W4A8-minmax 152.8 ms 166 MB
For production deployment, replace PTQLinear with bitsandbytes, TensorRT INT8, or ONNX Runtime quantized kernels to realize the actual speedup. The purpose of this benchmark is to measure accuracy and robustness trade-offs, not runtime throughput.
conda create -n edgevlm python=3.10 -y
conda activate edgevlm
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -e .Verify GPU:
python -c "import torch; print(torch.cuda.get_device_name(0))"If running in a cached HuggingFace environment, pass --local-files-only to all scripts.
python scripts/run_benchmark.py \ --model openai/clip-vit-base-patch32 \ --data-root ./data \ --eval-samples 1000 \ --calib-samples 128 \ --weight-bits 4 \ --method awq \ --corruptions gaussian_noise fog jpeg_compression \ --severities 1 3 5 \ --output results/w4_awq.json
python scripts/sweep_corruptions.py \ --model openai/clip-vit-base-patch32 \ --data-root ./data \ --eval-samples 1000 \ --output-dir results/corruption_sweep
Produces results/corruption_sweep/sweep_results.csv (50 rows per config) and summary.json.
python scripts/profile_edge.py \ --model openai/clip-vit-base-patch32 \ --data-root ./data \ --output-dir results/edge_profile
Outputs latency, memory, and accuracy for all configs. Saves pareto_embedded.png, pareto_mobile.png, pareto_edge_server.png.
python scripts/visualize_results.py \ --sweep-csv results/corruption_sweep/sweep_results.csv \ --profile-json results/edge_profile/profile_results.json \ --output-dir results/figures
EdgeVLM-Bench/
├── edgevlm/
│ ├── calibrate/
│ │ └── observers.py # AbsPercentileObserver, calibration hooks
│ ├── quantize/
│ │ └── ptq.py # PTQLinear, QuantConfig, replace_linear_layers
│ ├── robustness/
│ │ └── corruptions.py # 10 ImageNet-C corruptions ×ばつ 5 severities
│ ├── profile/
│ │ ├── memory.py # Peak memory tracking, EdgeBudget
│ │ ├── latency.py # CUDA event timing, LatencyProfile
│ │ └── pareto.py # Pareto dominance, frontier computation
│ ├── evaluate/
│ │ └── metrics.py # zero_shot_accuracy, feature_drift, robustness_score
│ └── visualize/
│ ├── pareto.py # Publication-quality Pareto scatter plots
│ └── robustness.py # OOD heatmaps, accuracy-drop bar charts
├── scripts/
│ ├── run_benchmark.py # Single (model, config) end-to-end run
│ ├── sweep_corruptions.py # Full grid ×ばつ corruption sweep
│ ├── profile_edge.py # Latency/memory profiling + Pareto analysis
│ └── visualize_results.py # Generate all figures from saved results
├── configs/
│ ├── default.yaml # Default experiment configuration
│ └── edge_profiles.yaml # Edge tier budgets and quantization grid
├── tests/
│ ├── test_quantize.py
│ ├── test_corruptions.py
│ └── test_profile.py
└── results/ # Saved JSON/CSV/PNG outputs (gitignored)
PTQLinear implements fake quantization — weights are stored as int16 but dequantized to FP32 at runtime. This measures representational quantization error accurately without requiring vendor INT4/INT8 kernels. AWQ and SmoothQuant scale factors are folded in at construction time, so forward() overhead matches the original layer.
The Robustness Score (RS) is the mean accuracy across all corruption/severity pairs, normalized by clean accuracy. It equals 1.0 for a perfectly robust model and decreases proportionally to sensitivity. The complementary Mean Corruption Error (mCE) measures the average absolute accuracy drop, following the ImageNet-C convention.
Given a set of (model, config) results, compute_pareto_frontier identifies non-dominated points under the three-objective space (↑ accuracy, ↓ latency, ↓ memory). filter_by_budget first restricts to edge-feasible configs, then the frontier identifies the best achievable accuracy/latency trade-offs within the constraint.
pytest tests/ -v --tb=short
All tests run on CPU; no GPU or HuggingFace downloads required.
- SmoothQuant (Xiao et al., ICML 2023): W8A8 PTQ by migrating activation outliers into weights.
- AWQ (Lin et al., MLSys 2024): Activation-aware weight-only quantization protecting salient channels.
- GPTQ (Frantar et al., ICLR 2023): One-shot second-order weight quantization.
- ImageNet-C (Hendrycks & Dietterich, ICLR 2019): Benchmarking robustness to common corruptions.