How-To Configure a ClusterServingRuntime v1.3.1

The November 2025 Innovation Release of EDB Postgres AI is available. For more information, see the release notes.

Prerequisite: Access to the Hybrid Manager UI with AI Factory enabled. See /edb-postgres-ai/1.3/hybrid-manager/ai-factory/.

This guide explains how to configure a ClusterServingRuntime in KServe. A ClusterServingRuntime defines the environment used to serve your AI models — specifying container image, resource settings, environment variables, and supported model formats.

For Hybrid Manager users, configuring runtimes is a core step toward enabling Model Serving — see Model Serving in Hybrid Manager.

Goal

Configure a ClusterServingRuntime so it can be used by InferenceServices to deploy models.

Estimated time

5–10 minutes.

What you will accomplish

  • Define a ClusterServingRuntime YAML manifest.
  • Apply it to your Kubernetes cluster.
  • Enable reusable serving configuration for one or more models.

What this unlocks

  • Supports consistent deployment of models using a standard runtime definition.
  • Allows for centralized control over serving images and resource profiles.
  • Required step for deploying NVIDIA NIM containers with KServe.

Prerequisites

  • Kubernetes cluster with KServe installed.
  • Access to container image registry with the desired model server image.
  • NVIDIA GPU node pool configured (if using GPU-based models).
  • (If required) Kubernetes secret configured for API keys (e.g., build.nvidia.com).

For background concepts, see:

Steps

1. Create ClusterServingRuntime YAML

Create a file named ClusterServingRuntime.yaml.

Example:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
 name: nvidia-nim-llama-3.1-8b-instruct-1.3.3
 namespace: default
spec:
 containers:
 - env:
 - name: NIM_CACHE_PATH
 value: /tmp
 - name: NGC_API_KEY
 valueFrom:
 secretKeyRef:
 name: nvidia-nim-secrets
 key: NGC_API_KEY
 image: upmdev.azurecr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
 name: kserve-container
 ports:
 - containerPort: 8000
 protocol: TCP
 resources:
 limits:
 cpu: "12"
 memory: 64Gi
 requests:
 cpu: "12"
 memory: 64Gi
 volumeMounts:
 - mountPath: /dev/shm
 name: dshm
imagePullSecrets:
 - name: edb-cred
protocolVersions:
 - v2
 - grpc-v2
supportedModelFormats:
 - autoSelect: true
 name: nvidia-nim-llama-3.1-8b-instruct
 priority: 1
 version: "1.3.3"
volumes:
 - emptyDir:
 medium: Memory
 sizeLimit: 16Gi
 name: dshm

Key fields explained:

  • containers.image: The model server container (e.g., NVIDIA NIM image).
  • resources: CPU, memory, and GPU requirements.
  • NGC_API_KEY: Secret reference for NVIDIA models.
  • supportedModelFormats: Logical name used by InferenceService to reference this runtime.

2. Apply the ClusterServingRuntime

Run:

kubectl apply -f ClusterServingRuntime.yaml

3. Verify deployed ClusterServingRuntime

Run:

kubectl get ClusterServingRuntime
Output
NAME AGE
nvidia-nim-llama-3.1-8b-instruct-1.3.3 1m

You can inspect full details with:

kubectl get ClusterServingRuntime <name> -o yaml

4. Reference runtime in InferenceService

When you create your InferenceService, reference this runtime:

runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3
modelFormat:
name: nvidia-nim-llama-3.1-8b-instruct

See Deploy an NVIDIA NIM container with KServe.

Notes

  • Runtimes are reusable — you can deploy multiple models referencing the same ClusterServingRuntime.
  • Use meaningful names and version fields in supportedModelFormats for traceability.
  • You can update a runtime by editing and re-applying the YAML.

Next steps