Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

feat: vLLM backend #2010

Draft
gau-nernst wants to merge 93 commits into dev
base: dev
Choose a base branch
Loading
from thien/python_engine
Draft

feat: vLLM backend #2010

gau-nernst wants to merge 93 commits into dev from thien/python_engine

Conversation

Copy link
Contributor

@gau-nernst gau-nernst commented Feb 21, 2025
edited
Loading

Describe Your Changes

High-level design

  • vLLM is an inference engine for large-scale (many GPUs)
  • cortex will spawn an vLLM subprocess and route the requests to vLLM

cortex engines install vllm

  • Download uv to cortexcpp/python_engines/bin/uv if uv is not installed
  • (via uv) Setup venv at cortexcpp/python_engines/envs/vllm/<version>/.venv
  • (via uv) Download vllm and its deps
  • Known issues:
    • Progress streaming is not supported (since download is done via uv instead of DownloadService).
    • It's not async since we need to wait for subprocess to finish (perhaps we will need a new SubprocessService in the future which handles async WaitProcess())
    • Hence, stopping and resuming download also does not work.

Note:

  • All cached Python packages are stored in cortexcpp/python_engines/cache/uv. The purpose is that when we remove python_engines folder, we are sure that we don't leave anything behind.

cortex models start <model>

  • Spawn vllm serve

TODO:

  • cortex engines install vllm (TODO: async install in separate thread)
  • Set default engine variant
  • cortex engines load vllm
  • cortex engines list
  • cortex engines uninstall vllm: delete cortexcpp/python_engines/envs/vllm/<version>
  • cortex pull <model>
  • cortex models list
  • cortex models start <model>: spawn vllm serve
  • cortex models stop <model>
  • cortex ps
  • Chat completion
    • Non-streaming
    • Streaming
  • Embeddings
  • cortex run

Fixes Issues

Self Checklist

  • Added relevant comments, esp in complex areas
  • Updated docs (for bug fixes / features)
  • Created issues for follow-up changes or refactoring needed

gau-nernst added 16 commits March 18, 2025 13:06
@gau-nernst gau-nernst moved this from Icebox to In Progress in Menlo Mar 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Reviewers
No reviews
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

vLLM backend for Cortex

AltStyle によって変換されたページ (->オリジナル) /