How-To Configure a ClusterServingRuntime v1.3.1

The November 2025 Innovation Release of EDB Postgres AI is available. For more information, see the release notes.

Prerequisite: Access to the Hybrid Manager UI with AI Factory enabled. See /edb-postgres-ai/1.3/hybrid-manager/ai-factory/.

This guide explains how to configure a ClusterServingRuntime in KServe. A ClusterServingRuntime defines the environment used to serve your AI models — specifying container image, resource settings, environment variables, and supported model formats.

For Hybrid Manager users, configuring runtimes is a core step toward enabling Model Serving — see Model Serving in Hybrid Manager.

Goal

Configure a ClusterServingRuntime so it can be used by InferenceServices to deploy models.

Estimated time

5–10 minutes.

What you will accomplish

Define a ClusterServingRuntime YAML manifest.
Apply it to your Kubernetes cluster.
Enable reusable serving configuration for one or more models.

What this unlocks

Supports consistent deployment of models using a standard runtime definition.
Allows for centralized control over serving images and resource profiles.
Required step for deploying NVIDIA NIM containers with KServe.

Prerequisites

Kubernetes cluster with KServe installed.
Access to container image registry with the desired model server image.
NVIDIA GPU node pool configured (if using GPU-based models).
(If required) Kubernetes secret configured for API keys (e.g., build.nvidia.com).

For background concepts, see:

Steps

1. Create ClusterServingRuntime YAML

Create a file named ClusterServingRuntime.yaml.

Example:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
 name: nvidia-nim-llama-3.1-8b-instruct-1.3.3
 namespace: default
spec:
 containers:
 - env:
 - name: NIM_CACHE_PATH
 value: /tmp
 - name: NGC_API_KEY
 valueFrom:
 secretKeyRef:
 name: nvidia-nim-secrets
 key: NGC_API_KEY
 image: upmdev.azurecr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
 name: kserve-container
 ports:
 - containerPort: 8000
 protocol: TCP
 resources:
 limits:
 cpu: "12"
 memory: 64Gi
 requests:
 cpu: "12"
 memory: 64Gi
 volumeMounts:
 - mountPath: /dev/shm
 name: dshm
imagePullSecrets:
 - name: edb-cred
protocolVersions:
 - v2
 - grpc-v2
supportedModelFormats:
 - autoSelect: true
 name: nvidia-nim-llama-3.1-8b-instruct
 priority: 1
 version: "1.3.3"
volumes:
 - emptyDir:
 medium: Memory
 sizeLimit: 16Gi
 name: dshm

Key fields explained:

containers.image: The model server container (e.g., NVIDIA NIM image).
resources: CPU, memory, and GPU requirements.
NGC_API_KEY: Secret reference for NVIDIA models.
supportedModelFormats: Logical name used by InferenceService to reference this runtime.

2. Apply the ClusterServingRuntime

Run:

kubectl apply -f ClusterServingRuntime.yaml

3. Verify deployed ClusterServingRuntime

Run:

kubectl get ClusterServingRuntime

Output

NAME AGE
nvidia-nim-llama-3.1-8b-instruct-1.3.3 1m

You can inspect full details with:

kubectl get ClusterServingRuntime <name> -o yaml

4. Reference runtime in InferenceService

When you create your InferenceService, reference this runtime:

runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3
modelFormat:
name: nvidia-nim-llama-3.1-8b-instruct

See Deploy an NVIDIA NIM container with KServe.

Notes

Runtimes are reusable — you can deploy multiple models referencing the same ClusterServingRuntime.
Use meaningful names and version fields in supportedModelFormats for traceability.
You can update a runtime by editing and re-applying the YAML.

How-To Configure a ClusterServingRuntime v1.3.1

Goal

Estimated time

What you will accomplish

What this unlocks

Prerequisites

Steps

1. Create ClusterServingRuntime YAML

2. Apply the ClusterServingRuntime

3. Verify deployed ClusterServingRuntime

4. Reference runtime in InferenceService

Notes

Next steps

← Prev

↑ Up

How-To Configure a ClusterServingRuntime v1.3.1

Goal

Estimated time

What you will accomplish

What this unlocks

Prerequisites

Steps

1. Create ClusterServingRuntime YAML

2. Apply the ClusterServingRuntime

3. Verify deployed ClusterServingRuntime

4. Reference runtime in InferenceService

Notes

Next steps

Related reading

← Prev

↑ Up