Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

thunlp/FR-Spec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

26 Commits

Repository files navigation

FR-Spec: Frequency-Ranked Speculative Sampling

arXiv License

Introduction

This is the C/CUDA implementation for FR-Spec

Surprisingly, EAGLE-2's bottleneck is LM-Head.

Leveraging the 'long-tail' property of token distribution, we achieve a 1.12x speedup over EAGLE-2.

Our method is simple to implement, preserves generation quality, and requires no retraining.

👉 Read our paper

Decoding Speed

Decoding speed (token/s) of FR-Spec and EAGLE-2 for Llama3-8B and Llama3.2-1B under different frameworks.

News

2025年07月15日 Our subsequent work: Sparse FFN (BlockFFN) + FR-Spec (paper, code)

2025年06月23日 Add support for Qwen2.

2025年06月08日 Our subsequent work: Sparse ATTN (InfLLM v2) + FR-Spec (paper, code)

2025年05月29日 Our subsequent work: a systematic analysis of Speculative + Quantization: (paper, code).

2025年05月15日 Accepted. (ACL 2025 main)

2025年03月03日 Feature merged to SGLang (link).

2025年03月01日 Implementation framework released.

2025年02月26日 Token-frequency statistics released.

Installation from source

conda create -n fr-spec python==3.11 && conda activate fr-spec
# install pytorch for your platform, see https://pytorch.org
git clone https://github.com/thunlp/FR-Spec.git --recursive && cd FR-Spec
vim setup.py # change arch="80" to other code for your platform, see https://developer.nvidia.com/cuda-gpus#compute
pip install .

Evaluation

Model Weights

Download the corresponding model weights and save them in the models folder.

Prepare Fr-Spec vocabulary subset

You can download our processed token-frequency statistics:

Or you can also get your token-frequency statistics based on our script:

cd fr
python fr.py --model_name <model_name> --model_path <model_path> --num_lines <num_lines> --vocab_size <vocab_size>
  • model_name: The name of the model (e.g.llama3-8b-instruct).
  • model_path: The path to the model (e.g. meta-llama/Meta-Llama-3-8B-Instruct).
  • num_lines: Number of lines to process from the SlimPajama dataset. Defaults to 1000000.
  • vocab_size: A list of vocabulary sizes to process. Each size represents a subset of the most frequent tokens to keep. Default values are [8192, 16384, 32768, 65536].

An example command for generating token frequency statistics from 1 million lines of the SlimPajama dataset for the Llama-3-8B-Instruct model:

python fr.py --model_name llama3-8b-instruct --model_path meta-llama/Meta-Llama-3-8B-Instruct --num_lines 1000000 --vocab_size <vocab_size>

The script analyzes token frequency distribution across num_lines of the SlimPajama corpus and saves the most frequent tokens (as specified by vocab_size) to the corresponding directory in fr-index. Copy the generated token-frequency files to the corresponding FR-Spec model folder to enable their use in your experiments.

🌟Welcome: We encourage you to upload your processed vocabulary for different models to HuggingFace (model name suffixed with ​FR-Spec).

Get Started

A simple example of using FR-Spec to generate text:

cd examples
python example_generate.py

Run Evaluation

All scripts for evaluation are located in the scripts folder. Here we use Llama-3-8B-Instruct as an example:

# 1. Run evaluations
bash scripts/<benchmark>/llama3-8b-instruct/run_baseline.sh
bash scripts/<benchmark>/llama3-8b-instruct/run_eagle.sh
bash scripts/<benchmark>/llama3-8b-instruct/run_eagle_fr_spec.sh
# 2. Evaluate speed
bash scripts/<benchmark>/llama3-8b-instruct/speed_up.sh
# 3. Check correctness (for human_eval and gsm8k only)
bash scripts/<benchmark>/llama3-8b-instruct/check_correctness.sh

Replace <benchmark> with one of: spec_bench, human_eval, or gsm8k.

Contributors

Acknowledgment

Our experiments are based on https://github.com/SafeAILab/EAGLE and https://github.com/FasterDecoding/Medusa.

The evaluation/ folder is modified base on https://github.com/hemingkx/Spec-Bench.

The src/flash_attn/ folder is modified base on https://github.com/Dao-AILab/flash-attention/blob/v2.4.2/csrc/flash_attn.

Citation

@article{zhao2025fr,
 title={FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling},
 author={Zhao, Weilin and Pan, Tengyu and Han, Xu and Zhang, Yudi and Sun, Ao and Huang, Yuxiang and Zhang, Kaihuo and Zhao, Weilun and Li, Yuxuan and Wang, Jianyong and others},
 journal={arXiv preprint arXiv:2502.14856},
 year={2025}
}

About

[ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Sampling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /