Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

parastoopil/EdgeVLM-Bench

Repository files navigation

EdgeVLM-Bench

Python 3.10+ License: MIT CI

Systematic benchmark for deploying quantized Vision-Language Models on SWaP-constrained edge hardware.

EdgeVLM-Bench evaluates the three-way trade-off between accuracy, OOD robustness, and resource efficiency (memory, latency, model size) for post-training quantized VLMs. It is designed for engineers and researchers choosing a deployment configuration for a specific edge tier — from embedded platforms (Jetson Nano, ~4 GB) to edge servers (Jetson AGX Orin, ~16 GB).


Motivation

Post-training quantization (PTQ) compresses VLMs to fit edge memory budgets, but introduces two underexplored failure modes:

  1. The robustness gap: 4-bit quantization amplifies accuracy degradation under distribution shift (sensor noise, weather, JPEG artifacts) beyond what clean-data accuracy suggests.
  2. Calibration sensitivity: the composition of calibration data — which modality, which domain — critically affects PTQ quality, especially in multimodal models.

This benchmark provides the tools to measure and compare both effects across a quantization grid, tied to concrete edge deployment budgets.


Pipeline Overview

×ばつ10 corruptions, │ │ (CIFAR-10) ×ばつ5 severities each) │ │ ─ zero-shot acc ─ accuracy under shift │ │ ─ feature drift ─ robustness score (RS) │ │ ─ mean corruption error (mCE) │ │ │ │ │ │ └───────────┬───────────┘ │ │ │ │ │ Profiling │ │ ─ inference latency (CUDA events) │ │ ─ peak GPU memory │ │ ─ estimated packed model size │ │ │ │ │ Pareto Analysis │ │ ─ feasible configs per edge tier │ │ ─ Pareto-frontier visualization │ └─────────────────────────────────────────────────────────────────┘">
┌─────────────────────────────────────────────────────────────────┐
│ EdgeVLM-Bench │
│ │
│ VLM (CLIP/BLIP) │
│ │ │
│ ├──► Calibration ──► PTQ Engine ──► Quantized Model │
│ │ (balanced / (minmax / │ │
│ │ dark / gray AWQ / │ │
│ │ noise) SmoothQuant) │ │
│ │ │ │
│ └──────────────────┬────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ Clean Eval OOD Eval (×ばつ10 corruptions, │
│ (CIFAR-10) ×ばつ5 severities each) │
│ ─ zero-shot acc ─ accuracy under shift │
│ ─ feature drift ─ robustness score (RS) │
│ ─ mean corruption error (mCE) │
│ │ │ │
│ └───────────┬───────────┘ │
│ │ │
│ Profiling │
│ ─ inference latency (CUDA events) │
│ ─ peak GPU memory │
│ ─ estimated packed model size │
│ │ │
│ Pareto Analysis │
│ ─ feasible configs per edge tier │
│ ─ Pareto-frontier visualization │
└─────────────────────────────────────────────────────────────────┘

Benchmark Dimensions

Quantization Grid

Config Method Bits Inspired by
FP16 16 Baseline
W8A8-minmax Per-channel symmetric W8A8 Classical PTQ
W8A8-SQ Activation/weight migration W8A8 SmoothQuant, ICML 2023
W4A8-AWQ Activation-aware channel scaling W4A8 AWQ, MLSys 2024
W4A8-minmax Per-channel symmetric W4A8 Ablation

OOD Corruption Suite (ImageNet-C style)

Category Corruptions
Noise Gaussian noise, shot noise, impulse noise
Blur Defocus blur, motion blur
Weather Fog, brightness
Digital Contrast, JPEG compression, pixelate

Each evaluated at severities 1–5.

Edge Deployment Tiers

Tier Platform Memory Budget Latency Budget
Embedded Jetson Nano / Xavier NX 4 GB 150 ms
Mobile Snapdragon 8 Gen 3 8 GB 80 ms
Edge Server Jetson AGX Orin 16 GB 30 ms

Key Results

(Reproduced on NVIDIA RTX 6000 Ada, CLIP ViT-B/32, CIFAR-10 zero-shot, 1 000 eval samples.)

Accuracy vs. OOD Robustness Trade-off

×ばつ W8A8-minmax 0.903 0.515 0.438 224.7 ×ばつ W8A8-SQ 0.896 0.523 0.428 224.7 ×ばつ W4A8-minmax 0.677 0.419 0.393 165.9 ×ばつ W4A8-AWQ* 0.417 0.460 0.225 165.9 ×ばつ">
Config Clean Acc RS (↑) mCE (↓) Size (MB) Compression
──────────────────────────────────────────────────────────────────────
FP16 0.901 0.516 0.436 577.1 ×ばつ
W8A8-minmax 0.903 0.515 0.438 224.7 ×ばつ
W8A8-SQ 0.896 0.523 0.428 224.7 ×ばつ
W4A8-minmax 0.677 0.419 0.393 165.9 ×ばつ
W4A8-AWQ* 0.417 0.460 0.225 165.9 ×ばつ

Key findings

  1. W8 quantization preserves accuracy and robustness nearly exactly — W8A8-minmax matches FP16 within 0.2% on both clean accuracy and robustness score at ×ばつ compression. This makes W8 the safe default for edge tiers where memory, not compute, is the binding constraint.

  2. W4 quantization reveals the robustness gap. W4A8-minmax drops 22.4 pp clean accuracy (0.901 → 0.677) but loses 18.8% of its robustness score (0.516 → 0.419). The model's relative sensitivity to corruptions increases under 4-bit compression — a floor effect that clean-accuracy benchmarks miss entirely.

  3. AWQ alpha sensitivity on vision encoders. W4A8-AWQ with a fixed α=0.5 underperforms plain minmax at 4 bits (0.417 vs 0.677 clean accuracy). AWQ's per-channel activation scaling was designed for LLMs where activation outliers concentrate in specific channels; CLIP's vision encoder has a different activation distribution, and α needs architecture-specific tuning. This is a concrete instance of the calibration-sensitivity problem studied in On the Sensitivity of Data-Driven Quantization in Vision–Language Models.

*W4A8-AWQ uses α=0.5 (fixed); tuning α per-layer is expected to recover significant accuracy.

Fake-Quant Latency Note

This repo uses fake quantization — weights are stored as int16 but dequantized to FP32 at runtime. This correctly measures representational quantization error without requiring vendor kernels, but fake-quant is slower than FP16 on GPU because it adds a cast-and-scale step. The measured latencies reflect this:

Config Latency (ms) Model Size (MB)
─────────────────────────────────────────────
FP16 14.7 ms 577 MB ← native CUDA FP16 kernels
W8A8-SQ 94.4 ms 225 MB ← fake-quant overhead
W4A8-AWQ 90.9 ms 166 MB
W4A8-minmax 152.8 ms 166 MB

For production deployment, replace PTQLinear with bitsandbytes, TensorRT INT8, or ONNX Runtime quantized kernels to realize the actual speedup. The purpose of this benchmark is to measure accuracy and robustness trade-offs, not runtime throughput.


Installation

conda create -n edgevlm python=3.10 -y
conda activate edgevlm
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -e .

Verify GPU:

python -c "import torch; print(torch.cuda.get_device_name(0))"

If running in a cached HuggingFace environment, pass --local-files-only to all scripts.


Quick Start

1. Run a single benchmark (W4A8-AWQ, 3 corruptions)

python scripts/run_benchmark.py \
 --model openai/clip-vit-base-patch32 \
 --data-root ./data \
 --eval-samples 1000 \
 --calib-samples 128 \
 --weight-bits 4 \
 --method awq \
 --corruptions gaussian_noise fog jpeg_compression \
 --severities 1 3 5 \
 --output results/w4_awq.json

2. Sweep all corruptions across the full quantization grid

python scripts/sweep_corruptions.py \
 --model openai/clip-vit-base-patch32 \
 --data-root ./data \
 --eval-samples 1000 \
 --output-dir results/corruption_sweep

Produces results/corruption_sweep/sweep_results.csv (50 rows per config) and summary.json.

3. Profile edge deployment configs and generate Pareto plots

python scripts/profile_edge.py \
 --model openai/clip-vit-base-patch32 \
 --data-root ./data \
 --output-dir results/edge_profile

Outputs latency, memory, and accuracy for all configs. Saves pareto_embedded.png, pareto_mobile.png, pareto_edge_server.png.

4. Generate all figures from saved results

python scripts/visualize_results.py \
 --sweep-csv results/corruption_sweep/sweep_results.csv \
 --profile-json results/edge_profile/profile_results.json \
 --output-dir results/figures

Repository Structure

×ばつ 5 severities │ ├── profile/ │ │ ├── memory.py # Peak memory tracking, EdgeBudget │ │ ├── latency.py # CUDA event timing, LatencyProfile │ │ └── pareto.py # Pareto dominance, frontier computation │ ├── evaluate/ │ │ └── metrics.py # zero_shot_accuracy, feature_drift, robustness_score │ └── visualize/ │ ├── pareto.py # Publication-quality Pareto scatter plots │ └── robustness.py # OOD heatmaps, accuracy-drop bar charts ├── scripts/ │ ├── run_benchmark.py # Single (model, config) end-to-end run │ ├── sweep_corruptions.py # Full grid ×ばつ corruption sweep │ ├── profile_edge.py # Latency/memory profiling + Pareto analysis │ └── visualize_results.py # Generate all figures from saved results ├── configs/ │ ├── default.yaml # Default experiment configuration │ └── edge_profiles.yaml # Edge tier budgets and quantization grid ├── tests/ │ ├── test_quantize.py │ ├── test_corruptions.py │ └── test_profile.py └── results/ # Saved JSON/CSV/PNG outputs (gitignored)">
EdgeVLM-Bench/
├── edgevlm/
│ ├── calibrate/
│ │ └── observers.py # AbsPercentileObserver, calibration hooks
│ ├── quantize/
│ │ └── ptq.py # PTQLinear, QuantConfig, replace_linear_layers
│ ├── robustness/
│ │ └── corruptions.py # 10 ImageNet-C corruptions ×ばつ 5 severities
│ ├── profile/
│ │ ├── memory.py # Peak memory tracking, EdgeBudget
│ │ ├── latency.py # CUDA event timing, LatencyProfile
│ │ └── pareto.py # Pareto dominance, frontier computation
│ ├── evaluate/
│ │ └── metrics.py # zero_shot_accuracy, feature_drift, robustness_score
│ └── visualize/
│ ├── pareto.py # Publication-quality Pareto scatter plots
│ └── robustness.py # OOD heatmaps, accuracy-drop bar charts
├── scripts/
│ ├── run_benchmark.py # Single (model, config) end-to-end run
│ ├── sweep_corruptions.py # Full grid ×ばつ corruption sweep
│ ├── profile_edge.py # Latency/memory profiling + Pareto analysis
│ └── visualize_results.py # Generate all figures from saved results
├── configs/
│ ├── default.yaml # Default experiment configuration
│ └── edge_profiles.yaml # Edge tier budgets and quantization grid
├── tests/
│ ├── test_quantize.py
│ ├── test_corruptions.py
│ └── test_profile.py
└── results/ # Saved JSON/CSV/PNG outputs (gitignored)

Design Notes

PTQ Implementation

PTQLinear implements fake quantization — weights are stored as int16 but dequantized to FP32 at runtime. This measures representational quantization error accurately without requiring vendor INT4/INT8 kernels. AWQ and SmoothQuant scale factors are folded in at construction time, so forward() overhead matches the original layer.

Robustness Metric

The Robustness Score (RS) is the mean accuracy across all corruption/severity pairs, normalized by clean accuracy. It equals 1.0 for a perfectly robust model and decreases proportionally to sensitivity. The complementary Mean Corruption Error (mCE) measures the average absolute accuracy drop, following the ImageNet-C convention.

Pareto Analysis

Given a set of (model, config) results, compute_pareto_frontier identifies non-dominated points under the three-objective space (↑ accuracy, ↓ latency, ↓ memory). filter_by_budget first restricts to edge-feasible configs, then the frontier identifies the best achievable accuracy/latency trade-offs within the constraint.


Tests

pytest tests/ -v --tb=short

All tests run on CPU; no GPU or HuggingFace downloads required.

References

  • SmoothQuant (Xiao et al., ICML 2023): W8A8 PTQ by migrating activation outliers into weights.
  • AWQ (Lin et al., MLSys 2024): Activation-aware weight-only quantization protecting salient channels.
  • GPTQ (Frantar et al., ICLR 2023): One-shot second-order weight quantization.
  • ImageNet-C (Hendrycks & Dietterich, ICLR 2019): Benchmarking robustness to common corruptions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /