Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

hrithiksagar-tih wants to merge 1 commit into openai:main

from hrithiksagar-tih:patch-1

Open

Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

hrithiksagar-tih wants to merge 1 commit into openai:main from hrithiksagar-tih:patch-1

+21 −11

Conversation

hrithiksagar-tih

Copy link

@hrithiksagar-tih hrithiksagar-tih commented Aug 11, 2025

Summary

This PR updates run-vllm.md to include a tested, offline-serve workflow for the OSS 20B model using vLLM. The added snippet replaces the previous example that failed on GPUs with compute capability <8.0 ("Required flash-infer sm7+" error). With the new code, users can load the 20B checkpoint locally—without flash-infer—and obtain correct responses on common data-center GPUs such as A100, H100 and V100,-series.

Motivation

The current cookbook example for running OSS 20B via vLLM does not execute on many setups because:

vLLM defaults to flash-infer kernels, which require NVIDIA sm80+ GPUs.
Most researchers running older A100/H100 or consumer RTX cards hit a runtime import error and cannot proceed.
By providing a drop-in replacement that disables flash-infer and switches to offline-serve, this PR:
Restores out-of-the-box functionality for a widely used 20B checkpoint.
Saves new users hours of debugging and forum searching.
Keeps the cookbook authoritative and production-ready.

For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
I have conducted a self-review of my content based on the contribution guidelines:
- Relevance: This content is related to building with OpenAI technologies and is useful to others.
- Uniqueness: I have searched for related examples in the OpenAI Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.
- Spelling and Grammar: I have checked for spelling or grammatical mistakes.
- Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
- Correctness: The information I include is correct and all of my code executes successfully.
- Completeness: I have explained everything fully, including all necessary references and citations.

We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.

@hrithiksagar-tih


 Update run-vllm.md

1c32222

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

Are you sure you want to change the base?

Fix OSS 20B vLLM example: add offline-serve workflow (no flash-infer sm7+) - Update run-vllm.md #2041

Conversation

@hrithiksagar-tih hrithiksagar-tih commented Aug 11, 2025

Summary

Motivation

For new content

Uh oh!

Uh oh!