Name	Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets	assets
data	data
evaluation	evaluation
examples	examples
fr	fr
llamacu	llamacu
models	models
scripts	scripts
src	src
.gitignore	.gitignore
.gitmodules	.gitmodules
README.md	README.md
setup.py	setup.py

FR-Spec: Frequency-Ranked Speculative Sampling

arXiv License

Introduction

This is the C/CUDA implementation for FR-Spec

Surprisingly, EAGLE-2's bottleneck is LM-Head.

Leveraging the 'long-tail' property of token distribution, we achieve a 1.12x speedup over EAGLE-2.

Our method is simple to implement, preserves generation quality, and requires no retraining.

👉 Read our paper

Decoding Speed

FR-Spec Architecture

Decoding speed (token/s) of FR-Spec and EAGLE-2 for Llama3-8B and Llama3.2-1B under different frameworks.

News

2025年07月15日 Our subsequent work: Sparse FFN (BlockFFN) + FR-Spec (paper, code)

2025年06月23日 Add support for Qwen2.

2025年06月08日 Our subsequent work: Sparse ATTN (InfLLM v2) + FR-Spec (paper, code)

2025年05月29日 Our subsequent work: a systematic analysis of Speculative + Quantization: (paper, code).

2025年05月15日 Accepted. (ACL 2025 main)

2025年03月03日 Feature merged to SGLang (link).

2025年03月01日 Implementation framework released.

2025年02月26日 Token-frequency statistics released.

Installation from source

conda create -n fr-spec python==3.11 && conda activate fr-spec
# install pytorch for your platform, see https://pytorch.org
git clone https://github.com/thunlp/FR-Spec.git --recursive && cd FR-Spec
vim setup.py # change arch="80" to other code for your platform, see https://developer.nvidia.com/cuda-gpus#compute
pip install .

Evaluation

Model Weights

Download the corresponding model weights and save them in the models folder.

Prepare Fr-Spec vocabulary subset

You can download our processed token-frequency statistics:

Or you can also get your token-frequency statistics based on our script:

cd fr
python fr.py --model_name <model_name> --model_path <model_path> --num_lines <num_lines> --vocab_size <vocab_size>

model_name: The name of the model (e.g.llama3-8b-instruct).
model_path: The path to the model (e.g. meta-llama/Meta-Llama-3-8B-Instruct).
num_lines: Number of lines to process from the SlimPajama dataset. Defaults to 1000000.
vocab_size: A list of vocabulary sizes to process. Each size represents a subset of the most frequent tokens to keep. Default values are [8192, 16384, 32768, 65536].

An example command for generating token frequency statistics from 1 million lines of the SlimPajama dataset for the Llama-3-8B-Instruct model:

python fr.py --model_name llama3-8b-instruct --model_path meta-llama/Meta-Llama-3-8B-Instruct --num_lines 1000000 --vocab_size <vocab_size>

The script analyzes token frequency distribution across num_lines of the SlimPajama corpus and saves the most frequent tokens (as specified by vocab_size) to the corresponding directory in fr-index. Copy the generated token-frequency files to the corresponding FR-Spec model folder to enable their use in your experiments.

🌟Welcome: We encourage you to upload your processed vocabulary for different models to HuggingFace (model name suffixed with FR-Spec).

Get Started

A simple example of using FR-Spec to generate text:

cd examples
python example_generate.py

Run Evaluation

All scripts for evaluation are located in the scripts folder. Here we use Llama-3-8B-Instruct as an example:

# 1. Run evaluations
bash scripts/<benchmark>/llama3-8b-instruct/run_baseline.sh
bash scripts/<benchmark>/llama3-8b-instruct/run_eagle.sh
bash scripts/<benchmark>/llama3-8b-instruct/run_eagle_fr_spec.sh
# 2. Evaluate speed
bash scripts/<benchmark>/llama3-8b-instruct/speed_up.sh
# 3. Check correctness (for human_eval and gsm8k only)
bash scripts/<benchmark>/llama3-8b-instruct/check_correctness.sh

Replace <benchmark> with one of: spec_bench, human_eval, or gsm8k.

Contributors

Acknowledgment

Our experiments are based on https://github.com/SafeAILab/EAGLE and https://github.com/FasterDecoding/Medusa.

The evaluation/ folder is modified base on https://github.com/hemingkx/Spec-Bench.

The src/flash_attn/ folder is modified base on https://github.com/Dao-AILab/flash-attention/blob/v2.4.2/csrc/flash_attn.

Citation

@article{zhao2025fr,
 title={FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling},
 author={Zhao, Weilin and Pan, Tengyu and Han, Xu and Zhang, Yudi and Sun, Ao and Huang, Yuxiang and Zhang, Kaihuo and Zhao, Weilun and Li, Yuxuan and Wang, Jianyong and others},
 journal={arXiv preprint arXiv:2502.14856},
 year={2025}
}

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

thunlp/FR-Spec

Folders and files

Latest commit

History

Repository files navigation

FR-Spec: Frequency-Ranked Speculative Sampling

Introduction

Decoding Speed

News

Installation from source

Evaluation

Model Weights

Prepare Fr-Spec vocabulary subset

Get Started

Run Evaluation

Contributors

Acknowledgment

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FR-Spec: Frequency-Ranked Speculative Sampling

Introduction

Decoding Speed

News

Installation from source

Evaluation

Model Weights

Prepare Fr-Spec vocabulary subset

Get Started

Run Evaluation

Contributors

Acknowledgment

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages