Name	Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows	.github/workflows
configs	configs
edgevlm	edgevlm
notebooks	notebooks
results	results
scripts	scripts
tests	tests
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt

EdgeVLM-Bench

Systematic benchmark for deploying quantized Vision-Language Models on SWaP-constrained edge hardware.

EdgeVLM-Bench evaluates the three-way trade-off between accuracy, OOD robustness, and resource efficiency (memory, latency, model size) for post-training quantized VLMs. It is designed for engineers and researchers choosing a deployment configuration for a specific edge tier — from embedded platforms (Jetson Nano, ~4 GB) to edge servers (Jetson AGX Orin, ~16 GB).

Motivation

Post-training quantization (PTQ) compresses VLMs to fit edge memory budgets, but introduces two underexplored failure modes:

The robustness gap: 4-bit quantization amplifies accuracy degradation under distribution shift (sensor noise, weather, JPEG artifacts) beyond what clean-data accuracy suggests.
Calibration sensitivity: the composition of calibration data — which modality, which domain — critically affects PTQ quality, especially in multimodal models.

This benchmark provides the tools to measure and compare both effects across a quantization grid, tied to concrete edge deployment budgets.

Pipeline Overview

×ばつ10 corruptions, │ │ (CIFAR-10) ×ばつ5 severities each) │ │ ─ zero-shot acc ─ accuracy under shift │ │ ─ feature drift ─ robustness score (RS) │ │ ─ mean corruption error (mCE) │ │ │ │ │ │ └───────────┬───────────┘ │ │ │ │ │ Profiling │ │ ─ inference latency (CUDA events) │ │ ─ peak GPU memory │ │ ─ estimated packed model size │ │ │ │ │ Pareto Analysis │ │ ─ feasible configs per edge tier │ │ ─ Pareto-frontier visualization │ └─────────────────────────────────────────────────────────────────┘">

┌─────────────────────────────────────────────────────────────────┐
│ EdgeVLM-Bench │
│ │
│ VLM (CLIP/BLIP) │
│ │ │
│ ├──► Calibration ──► PTQ Engine ──► Quantized Model │
│ │ (balanced / (minmax / │ │
│ │ dark / gray AWQ / │ │
│ │ noise) SmoothQuant) │ │
│ │ │ │
│ └──────────────────┬────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ Clean Eval OOD Eval (×ばつ10 corruptions, │
│ (CIFAR-10) ×ばつ5 severities each) │
│ ─ zero-shot acc ─ accuracy under shift │
│ ─ feature drift ─ robustness score (RS) │
│ ─ mean corruption error (mCE) │
│ │ │ │
│ └───────────┬───────────┘ │
│ │ │
│ Profiling │
│ ─ inference latency (CUDA events) │
│ ─ peak GPU memory │
│ ─ estimated packed model size │
│ │ │
│ Pareto Analysis │
│ ─ feasible configs per edge tier │
│ ─ Pareto-frontier visualization │
└─────────────────────────────────────────────────────────────────┘

Benchmark Dimensions

Quantization Grid

Config	Method	Bits	Inspired by
FP16	—	16	Baseline
W8A8-minmax	Per-channel symmetric	W8A8	Classical PTQ
W8A8-SQ	Activation/weight migration	W8A8	SmoothQuant, ICML 2023
W4A8-AWQ	Activation-aware channel scaling	W4A8	AWQ, MLSys 2024
W4A8-minmax	Per-channel symmetric	W4A8	Ablation

OOD Corruption Suite (ImageNet-C style)

Category	Corruptions
Noise	Gaussian noise, shot noise, impulse noise
Blur	Defocus blur, motion blur
Weather	Fog, brightness
Digital	Contrast, JPEG compression, pixelate

Each evaluated at severities 1–5.

Edge Deployment Tiers

Tier	Platform	Memory Budget	Latency Budget
Embedded	Jetson Nano / Xavier NX	4 GB	150 ms
Mobile	Snapdragon 8 Gen 3	8 GB	80 ms
Edge Server	Jetson AGX Orin	16 GB	30 ms

Key Results

(Reproduced on NVIDIA RTX 6000 Ada, CLIP ViT-B/32, CIFAR-10 zero-shot, 1 000 eval samples.)

Accuracy vs. OOD Robustness Trade-off

×ばつ W8A8-minmax 0.903 0.515 0.438 224.7 ×ばつ W8A8-SQ 0.896 0.523 0.428 224.7 ×ばつ W4A8-minmax 0.677 0.419 0.393 165.9 ×ばつ W4A8-AWQ* 0.417 0.460 0.225 165.9 ×ばつ">

Config Clean Acc RS (↑) mCE (↓) Size (MB) Compression
──────────────────────────────────────────────────────────────────────
FP16 0.901 0.516 0.436 577.1 ×ばつ
W8A8-minmax 0.903 0.515 0.438 224.7 ×ばつ
W8A8-SQ 0.896 0.523 0.428 224.7 ×ばつ
W4A8-minmax 0.677 0.419 0.393 165.9 ×ばつ
W4A8-AWQ* 0.417 0.460 0.225 165.9 ×ばつ

Key findings

W8 quantization preserves accuracy and robustness nearly exactly — W8A8-minmax matches FP16 within 0.2% on both clean accuracy and robustness score at ×ばつ compression. This makes W8 the safe default for edge tiers where memory, not compute, is the binding constraint.

W4 quantization reveals the robustness gap. W4A8-minmax drops 22.4 pp clean accuracy (0.901 → 0.677) but loses 18.8% of its robustness score (0.516 → 0.419). The model's relative sensitivity to corruptions increases under 4-bit compression — a floor effect that clean-accuracy benchmarks miss entirely.

AWQ alpha sensitivity on vision encoders. W4A8-AWQ with a fixed α=0.5 underperforms plain minmax at 4 bits (0.417 vs 0.677 clean accuracy). AWQ's per-channel activation scaling was designed for LLMs where activation outliers concentrate in specific channels; CLIP's vision encoder has a different activation distribution, and α needs architecture-specific tuning. This is a concrete instance of the calibration-sensitivity problem studied in On the Sensitivity of Data-Driven Quantization in Vision–Language Models.

*W4A8-AWQ uses α=0.5 (fixed); tuning α per-layer is expected to recover significant accuracy.

Fake-Quant Latency Note

This repo uses fake quantization — weights are stored as int16 but dequantized to FP32 at runtime. This correctly measures representational quantization error without requiring vendor kernels, but fake-quant is slower than FP16 on GPU because it adds a cast-and-scale step. The measured latencies reflect this:

Config Latency (ms) Model Size (MB)
─────────────────────────────────────────────
FP16 14.7 ms 577 MB ← native CUDA FP16 kernels
W8A8-SQ 94.4 ms 225 MB ← fake-quant overhead
W4A8-AWQ 90.9 ms 166 MB
W4A8-minmax 152.8 ms 166 MB

For production deployment, replace PTQLinear with bitsandbytes, TensorRT INT8, or ONNX Runtime quantized kernels to realize the actual speedup. The purpose of this benchmark is to measure accuracy and robustness trade-offs, not runtime throughput.

Installation

conda create -n edgevlm python=3.10 -y
conda activate edgevlm
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -e .

Verify GPU:

python -c "import torch; print(torch.cuda.get_device_name(0))"

If running in a cached HuggingFace environment, pass --local-files-only to all scripts.

Quick Start

1. Run a single benchmark (W4A8-AWQ, 3 corruptions)

python scripts/run_benchmark.py \
 --model openai/clip-vit-base-patch32 \
 --data-root ./data \
 --eval-samples 1000 \
 --calib-samples 128 \
 --weight-bits 4 \
 --method awq \
 --corruptions gaussian_noise fog jpeg_compression \
 --severities 1 3 5 \
 --output results/w4_awq.json

2. Sweep all corruptions across the full quantization grid

python scripts/sweep_corruptions.py \
 --model openai/clip-vit-base-patch32 \
 --data-root ./data \
 --eval-samples 1000 \
 --output-dir results/corruption_sweep

Produces results/corruption_sweep/sweep_results.csv (50 rows per config) and summary.json.

3. Profile edge deployment configs and generate Pareto plots

python scripts/profile_edge.py \
 --model openai/clip-vit-base-patch32 \
 --data-root ./data \
 --output-dir results/edge_profile

Outputs latency, memory, and accuracy for all configs. Saves pareto_embedded.png, pareto_mobile.png, pareto_edge_server.png.

4. Generate all figures from saved results

python scripts/visualize_results.py \
 --sweep-csv results/corruption_sweep/sweep_results.csv \
 --profile-json results/edge_profile/profile_results.json \
 --output-dir results/figures

Repository Structure

×ばつ 5 severities │ ├── profile/ │ │ ├── memory.py # Peak memory tracking, EdgeBudget │ │ ├── latency.py # CUDA event timing, LatencyProfile │ │ └── pareto.py # Pareto dominance, frontier computation │ ├── evaluate/ │ │ └── metrics.py # zero_shot_accuracy, feature_drift, robustness_score │ └── visualize/ │ ├── pareto.py # Publication-quality Pareto scatter plots │ └── robustness.py # OOD heatmaps, accuracy-drop bar charts ├── scripts/ │ ├── run_benchmark.py # Single (model, config) end-to-end run │ ├── sweep_corruptions.py # Full grid ×ばつ corruption sweep │ ├── profile_edge.py # Latency/memory profiling + Pareto analysis │ └── visualize_results.py # Generate all figures from saved results ├── configs/ │ ├── default.yaml # Default experiment configuration │ └── edge_profiles.yaml # Edge tier budgets and quantization grid ├── tests/ │ ├── test_quantize.py │ ├── test_corruptions.py │ └── test_profile.py └── results/ # Saved JSON/CSV/PNG outputs (gitignored)">

EdgeVLM-Bench/
├── edgevlm/
│ ├── calibrate/
│ │ └── observers.py # AbsPercentileObserver, calibration hooks
│ ├── quantize/
│ │ └── ptq.py # PTQLinear, QuantConfig, replace_linear_layers
│ ├── robustness/
│ │ └── corruptions.py # 10 ImageNet-C corruptions ×ばつ 5 severities
│ ├── profile/
│ │ ├── memory.py # Peak memory tracking, EdgeBudget
│ │ ├── latency.py # CUDA event timing, LatencyProfile
│ │ └── pareto.py # Pareto dominance, frontier computation
│ ├── evaluate/
│ │ └── metrics.py # zero_shot_accuracy, feature_drift, robustness_score
│ └── visualize/
│ ├── pareto.py # Publication-quality Pareto scatter plots
│ └── robustness.py # OOD heatmaps, accuracy-drop bar charts
├── scripts/
│ ├── run_benchmark.py # Single (model, config) end-to-end run
│ ├── sweep_corruptions.py # Full grid ×ばつ corruption sweep
│ ├── profile_edge.py # Latency/memory profiling + Pareto analysis
│ └── visualize_results.py # Generate all figures from saved results
├── configs/
│ ├── default.yaml # Default experiment configuration
│ └── edge_profiles.yaml # Edge tier budgets and quantization grid
├── tests/
│ ├── test_quantize.py
│ ├── test_corruptions.py
│ └── test_profile.py
└── results/ # Saved JSON/CSV/PNG outputs (gitignored)

Design Notes

PTQ Implementation

PTQLinear implements fake quantization — weights are stored as int16 but dequantized to FP32 at runtime. This measures representational quantization error accurately without requiring vendor INT4/INT8 kernels. AWQ and SmoothQuant scale factors are folded in at construction time, so forward() overhead matches the original layer.

Robustness Metric

The Robustness Score (RS) is the mean accuracy across all corruption/severity pairs, normalized by clean accuracy. It equals 1.0 for a perfectly robust model and decreases proportionally to sensitivity. The complementary Mean Corruption Error (mCE) measures the average absolute accuracy drop, following the ImageNet-C convention.

Pareto Analysis

Given a set of (model, config) results, compute_pareto_frontier identifies non-dominated points under the three-objective space (↑ accuracy, ↓ latency, ↓ memory). filter_by_budget first restricts to edge-feasible configs, then the frontier identifies the best achievable accuracy/latency trade-offs within the constraint.

Tests

pytest tests/ -v --tb=short

All tests run on CPU; no GPU or HuggingFace downloads required.

References

SmoothQuant (Xiao et al., ICML 2023): W8A8 PTQ by migrating activation outliers into weights.
AWQ (Lin et al., MLSys 2024): Activation-aware weight-only quantization protecting salient channels.
GPTQ (Frantar et al., ICLR 2023): One-shot second-order weight quantization.
ImageNet-C (Hendrycks & Dietterich, ICLR 2019): Benchmarking robustness to common corruptions.

Folders and files

Latest commit

History

Repository files navigation

EdgeVLM-Bench

Motivation

Pipeline Overview

Benchmark Dimensions

Quantization Grid

OOD Corruption Suite (ImageNet-C style)

Edge Deployment Tiers

Key Results

Accuracy vs. OOD Robustness Trade-off

Fake-Quant Latency Note

Installation

Quick Start

1. Run a single benchmark (W4A8-AWQ, 3 corruptions)

2. Sweep all corruptions across the full quantization grid

3. Profile edge deployment configs and generate Pareto plots

4. Generate all figures from saved results

Repository Structure

Design Notes

PTQ Implementation

Robustness Metric

Pareto Analysis

Tests

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages