Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

EfficientMoE/MoE-Infinity

Repository files navigation

MoE-Infinity

MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference.

MoE-Infinity is cost-effective yet fast:

  • Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
  • Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
  • Supporting LLM acceleration techniques (such as FlashAttention).
  • Supporting multi-GPU environments with numeorous OS-level performance optimizations.
  • Achieving SOTA latency performance when serving MoEs in a resource-constrained GPU environment (in comparison with vLLM, HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).

MoE-Infinity is easy-to-use:

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

Contents

Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes LongBench, GSM8K, FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.

Switch-large-128 NLLB-MoE-54B Mixtral-8x7b DeepSeek-V2-Lite
(追記) MoE-Infinity (追記ここまで) (追記) 0.130 (追記ここまで) (追記) 0.119 (追記ここまで) (追記) 0.735 (追記ここまで) (追記) 0.155 (追記ここまで)
Accelerate 1.043 3.071 6.633 1.743
DeepSpeed 4.578 8.381 2.486 0.737
Mixtral Offloading X X 1.752 X
Ollama X X 0.903 1.250
vLLM X X 2.137 0.493

Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

conda create -n moe-infinity python=3.9
conda activate moe-infinity
# install from either PyPI or Source will trigger requirements.txt automatically

Install from PyPI

# install stable release
pip install moe-infinity
# install nightly release
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity

Install from Source

git clone https://github.com/EfficientMoE/MoE-Infinity.git
cd MoE-Infinity
pip install -e .
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

Enable FlashAttention (Optional)

Install FlashAttention (>=2.5.2) for faster inference with the following command.

FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.

Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

Important Note

  • The offload_path must be unique for each MoE model. Reusing the same offload_path for different MoE models will result in unexpected behavior.

Sample Code of Huggingface LLM Inference

import torch
import os
from transformers import AutoTokenizer
from moe_infinity import MoE
user_home = os.path.expanduser('~')
checkpoint = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote=True)
config = {
 "offload_path": os.path.join(user_home, "moe-infinity"),
 "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}
model = MoE(checkpoint, config)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")
output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

Running Inference

This command runs the script on selected GPUs.

CUDA_VISIBLE_DEVICES=0,1 python script.py

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite-Chat" --offload_dir <your local path on SSD>

OpenAI-Compatible Server

Start the OpenAI-compatible server locally

python -m moe_infinity.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-V2-Lite-Chat --offload-dir ./offload_dir

Query the model via /v1/components/. (We currently only support the required fields, i.e., "model" and "prompt").

curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "deepseek-ai/DeepSeek-V2-Lite-Chat",
 "prompt": "Hello, my name is"
 }'

You can also use openai python package to query the model.

pip install openai
python tests/test_oai_completions.py

Query the model via /v1/chat/completions. (We currently only support the required fields, i.e., "model" and "messages").

curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "deepseek-ai/DeepSeek-V2-Lite-Chat",
 "messages": [
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": "Tell me a joke"}
 ]
 }'

You can also use openai python package to query the model.

pip install openai
python tests/test_oai_chat_completions.py

Release Plan

We plan to release two functions in the following months:

  • We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
  • Supporting expert parallelism for distributed MoE inference.
  • More (We welcome contributors to join us!)

Citation

If you use MoE-Inifity for your research, please cite our paper:

@misc{moe-infinity,
 author = {Leyang Xue and
 Yao Fu and
 Zhan Lu and
 Luo Mai and
 Mahesh Marina},
 title = {MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache},
 archivePrefix= {arXiv},
 eprint = {2401.14361},
 year = {2024}
}

About

PyTorch library for cost-effective, fast and easy serving of MoE models.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

AltStyle によって変換されたページ (->オリジナル) /