Deploy open models with a custom vLLM container
Stay organized with collections
Save and categorize content based on your preferences.
Although the various Vertex AI model serving options are sufficient for many use cases, you might need to use your own container images to serve models on Vertex AI. This document describes how to use a vLLM custom container image to serve models on Vertex AI on CPUs, GPUs, or TPUs. For more information about vLLM supported models, see the vLLM documentation.
The vLLM API server implements the OpenAI API protocol, but it does not support the Vertex AI request and response requirements. Therefore, you must use a Vertex AI Raw Inference Request to get inferences from models deployed to Vertex AI using a Prediction Endpoint. For more information about the Raw Prediction method in the Vertex AI Python SDK, see the Python SDK documentation.
You can source models from both Hugging Face and Cloud Storage. This approach offers flexibility, which lets you take advantage of the community-driven model hub (Hugging Face) and the optimized data transfer and security capabilities of Cloud Storage for internal model management or fine-tuned versions.
vLLM downloads the models from Hugging Face if a Hugging Face access token is provided. Otherwise, vLLM assumes the model is available on local disk. The custom container image lets Vertex AI download the model from Google Cloud in addition to Hugging Face.
Before you begin
In your Google Cloud project, enable the Vertex AI and Artifact Registry APIs.
gcloudservicesenableaiplatform.googleapis.com\ artifactregistry.googleapis.comConfigure Google Cloud CLI with your project ID and initialize Vertex AI SDK.
PROJECT_ID = "PROJECT_ID" LOCATION = "LOCATION" importvertexai vertexai .init(project=PROJECT_ID, location=LOCATION)gcloudconfigsetproject{PROJECT_ID}Create a Docker repository in Artifact Registry.
gcloudartifactsrepositoriescreateDOCKER_REPOSITORY\ --repository-format=docker\ --location=LOCATION\ --description="Vertex AI Docker repository"Optional: If downloading models from Hugging Face, obtain a Hugging Face token.
- Create a Hugging Face account if you don't have one.
- For gated models like Llama 3.2, request and receive access on Hugging Face before proceeding.
- Generate an access token: Go to Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a name and a role of at least Read.
- Select Generate a token.
- Save this token for the deployment steps.
Prepare container build files
The following Dockerfile builds the vLLM custom container image for GPUs,
TPUs, and CPUs. This custom container downloads models from Hugging Face or
Cloud Storage.
ARGBASE_IMAGE
FROM${BASE_IMAGE}
ENVDEBIAN_FRONTEND=noninteractive
# Install gcloud SDK
RUNapt-getupdate && \
apt-getinstall-yapt-utilsgitapt-transport-httpsgnupgca-certificatescurl\
&& echo"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main"|tee-a/etc/apt/sources.list.d/google-cloud-sdk.list\
&& curlhttps://packages.cloud.google.com/apt/doc/apt-key.gpg|gpg--dearmor-o/usr/share/keyrings/cloud.google.gpg\
&& apt-getupdate-y && apt-getinstallgoogle-cloud-cli-y\
&& rm-rf/var/lib/apt/lists/*
WORKDIR/workspace/vllm
# Copy entrypoint.sh to the container
COPY./entrypoint.sh/workspace/vllm/vertexai/entrypoint.sh
RUNchmod+x/workspace/vllm/vertexai/entrypoint.sh
ENTRYPOINT["/workspace/vllm/vertexai/entrypoint.sh"]
Build the custom container image using Cloud Build. The following
cloudbuild.yaml configuration file shows how to build the image for multiple
platforms using the same Dockerfile.
steps:
-name:'gcr.io/cloud-builders/docker'
automapSubstitutions:true
script:|
#!/usr/bin/env bash
set -euo pipefail
device_type_param=${_DEVICE_TYPE}
device_type=${device_type_param,,}
base_image=${_BASE_IMAGE}
image_name="vllm-${_DEVICE_TYPE}"
if [[ $device_type == "cpu" ]]; then
echo "Quietly building open source vLLM CPU container image"
git clone https://github.com/vllm-project/vllm.git
cd vllm && DOCKER_BUILDKIT=1 docker build -t $base_image -f docker/Dockerfile.cpu . -q
cd ..
fi
echo "Quietly building container image for: $device_type"
docker build -t $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name --build-arg BASE_IMAGE=$base_image . -q
docker push $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name
substitutions:
_DEVICE_TYPE:gpu
_BASE_IMAGE:vllm/vllm-openai
_REPOSITORY:my-docker-repo
The files are available in the googlecloudplatform/vertex-ai-samples GitHub
repository. Clone the repository to use them:
gitclonehttps://github.com/GoogleCloudPlatform/vertex-ai-samples.git
Build and push the container image
Build the custom container image using Cloud Build by submitting the
cloudbuild.yaml file. Use substitutions to specify the target device type,
which can be GPU, TPU, or CPU, and the corresponding base image.
GPU
DEVICE_TYPE="gpu"
BASE_IMAGE="vllm/vllm-openai"
cdvertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloudbuildssubmit\
--config=cloudbuild.yaml\
--region=LOCATION\
--timeout="2h"\
--machine-type=e2-highcpu-32\
--substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE
TPU
DEVICE_TYPE="tpu"
BASE_IMAGE="vllm/vllm-tpu:nightly"
cdvertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloudbuildssubmit\
--config=cloudbuild.yaml\
--region=LOCATION\
--timeout="2h"\
--machine-type=e2-highcpu-32\
--substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE
CPU
DEVICE_TYPE="cpu"
BASE_IMAGE="vllm-cpu-base"
cdvertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloudbuildssubmit\
--config=cloudbuild.yaml\
--region=LOCATION\
--timeout="2h"\
--machine-type=e2-highcpu-32\
--substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE
After the build finishes, configure Docker to authenticate with Artifact Registry:
gcloudauthconfigure-dockerLOCATION-docker.pkg.dev--quiet
Upload model to Model Registry and deploy
Upload your model to Vertex AI Model Registry, create an endpoint, and deploy the model by completing these steps. This example uses Llama 3.2 3B, but you can adapt it for other models.
Define model and deployment variables. Set the
DOCKER_URIvariable to the image you built in the previous step (for example, for GPU):DOCKER_URI = f"LOCATION-docker.pkg.dev/PROJECT_ID/DOCKER_REPOSITORY/vllm-gpu"Define variables for the Hugging Face token and model properties. For example, for GPU deployment:
hf_token = "your-hugging-face-auth-token" model_name = "gpu-llama3_2_3B-serve-vllm" model_id = "meta-llama/Llama-3.2-3B" machine_type = "g2-standard-8" accelerator_type = "NVIDIA_L4" accelerator_count = 1Upload model to Model Registry. The
upload_modelfunction varies slightly depending on the device type because of different vLLM arguments and environment variables.fromgoogle.cloudimport aiplatform defupload_model_gpu(model_name, model_id, hf_token, accelerator_count, docker_uri): vllm_args = [ "python3", "-m", "vllm.entrypoints.openai.api_server", "--host=0.0.0.0", "--port=8080", f"--model={model_id}", "--max-model-len=2048", "--gpu-memory-utilization=0.9", "--enable-prefix-caching", f"--tensor-parallel-size={accelerator_count}", ] env_vars = { "HF_TOKEN": hf_token, "LD_LIBRARY_PATH": "$LD_LIBRARY_PATH:/usr/local/nvidia/lib64", } model = aiplatform.Model.upload( display_name=model_name, serving_container_image_uri=docker_uri, serving_container_args=vllm_args, serving_container_ports=[8080], serving_container_predict_route="/v1/completions", serving_container_health_route="/health", serving_container_environment_variables=env_vars, serving_container_shared_memory_size_mb=(16 * 1024), # 16 GB serving_container_deployment_timeout=1800, ) return model defupload_model_tpu(model_name, model_id, hf_token, tpu_count, docker_uri): vllm_args = [ "python3", "-m", "vllm.entrypoints.openai.api_server", "--host=0.0.0.0", "--port=8080", f"--model={model_id}", "--max-model-len=2048", "--enable-prefix-caching", f"--tensor-parallel-size={tpu_count}", ] env_vars = {"HF_TOKEN": hf_token} model = aiplatform.Model.upload( display_name=model_name, serving_container_image_uri=docker_uri, serving_container_args=vllm_args, serving_container_ports=[8080], serving_container_predict_route="/v1/completions", serving_container_health_route="/health", serving_container_environment_variables=env_vars, serving_container_shared_memory_size_mb=(16 * 1024), # 16 GB serving_container_deployment_timeout=1800, ) return model defupload_model_cpu(model_name, model_id, hf_token, docker_uri): vllm_args = [ "python3", "-m", "vllm.entrypoints.openai.api_server", "--host=0.0.0.0", "--port=8080", f"--model={model_id}", "--max-model-len=2048", ] env_vars = {"HF_TOKEN": hf_token} model = aiplatform.Model.upload( display_name=model_name, serving_container_image_uri=docker_uri, serving_container_args=vllm_args, serving_container_ports=[8080], serving_container_predict_route="/v1/completions", serving_container_health_route="/health", serving_container_environment_variables=env_vars, serving_container_shared_memory_size_mb=(16 * 1024), # 16 GB serving_container_deployment_timeout=1800, ) return model # Example for GPU: vertexai_model = upload_model_gpu(model_name, model_id, hf_token, accelerator_count, DOCKER_URI)Create an endpoint.
endpoint = aiplatform.Endpoint.create(display_name=f"model_name-endpoint")Deploy model to endpoint. Model deployment might take 20 to 30 minutes.
# Example for GPU: vertexai_model.deploy( endpoint=endpoint, deployed_model_display_name=model_name, machine_type=machine_type, accelerator_type=accelerator_type, accelerator_count=accelerator_count, traffic_percentage=100, deploy_request_timeout=1800, min_replica_count=1, max_replica_count=4, autoscaling_target_accelerator_duty_cycle=60, )For TPUs, omit the
accelerator_typeandaccelerator_countparameters, and useautoscaling_target_request_count_per_minute=60. For CPUs, omit theaccelerator_typeandaccelerator_countparameters, and useautoscaling_target_cpu_utilization=60.
Load models from Cloud Storage
The custom container downloads the model from a Cloud Storage location instead of downloading it from Hugging Face. When you use Cloud Storage:
- Set the
model_idparameter in theupload_modelfunction to a Cloud Storage URI, for example,gs://<var>my-bucket</var>/<var>my-models</var>/<var>llama_3_2_3B</var>. - Omit the
HF_TOKENvariable fromenv_varswhen you callupload_model. - When you call
model.deploy, specify aservice_accountthat has permissions to read from the Cloud Storage bucket.
Create an IAM Service Account for Cloud Storage access
If your model is in Cloud Storage, create a service account that Vertex Prediction endpoints can use to access the model artifacts.
SERVICE_ACCOUNT_NAME = "vertexai-endpoint-sa"
SERVICE_ACCOUNT_EMAIL = f"SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com"
gcloudiamservice-accountscreateSERVICE_ACCOUNT_NAME\
--display-name="Vertex AI Endpoint Service Account"
# Grant storage read permission
gcloudprojectsadd-iam-policy-bindingPROJECT_ID\
--member="serviceAccount:SERVICE_ACCOUNT_EMAIL"\
--role="roles/storage.objectViewer"
When you deploy, pass the service account email to the deploy method:
service_account=<var>SERVICE_ACCOUNT_EMAIL</var>.
Get predictions using endpoint
After you successfully deploy the model to the endpoint, verify the model
response using raw_predict.
importjson
PROMPT = "Distance of moon from earth is "
request_body = json.dumps(
{
"prompt": PROMPT,
"temperature": 0.0,
},
)
raw_response = endpoint.raw_predict(
body=request_body, headers={"Content-Type": "application/json"}
)
assert raw_response.status_code == 200
result = json.loads(raw_response.text)
for choice in result["choices"]:
print(choice)
Example output:
{
"index":0,
"text":"384,400 km. The moon is 1/4 of the earth's",
"logprobs":null,
"finish_reason":"length",
"stop_reason":null,
"prompt_logprobs":null
}
What's next
- Choose an open model serving option
- Use open models using Model as a Service (MaaS)
- Deploy open models from Model Garden
- Deploy open models with prebuilt containers