Model Serving Explained v1.3.1

The November 2025 Innovation Release of EDB Postgres AI is available. For more information, see the release notes.

Model Serving in AI Factory allows you to deploy AI models as scalable, production-grade inference services — running on your Kubernetes infrastructure.

It provides a Kubernetes-native architecture based on KServe, giving your models the ability to serve predictions and embeddings over network-accessible APIs.

AI Factory Model Serving is optimized to support enterprise-class AI workloads with:

  • GPU-accelerated infrastructure
  • Flexible scaling
  • Integrated observability
  • Sovereign AI alignment — models run under your governance
  • Seamless integration with Gen AI Builder, Knowledge Bases, and other AI Factory pipelines

Before you start

Prerequisites for understanding Model Serving:

  • Familiarity with Kubernetes basics
  • Understanding of KServe and InferenceService
  • Awareness of Model Library → Model Serving workflow in AI Factory
  • Understanding of Sovereign AI principles — models running under your governance

Suggested reading:


How Model Serving works

Core stack

LayerPurpose
AI FactoryProvides infrastructure and Model Serving APIs
Hybrid Manager Kubernetes ClusterHosts model-serving workloads
KServeManages model serving lifecycle and APIs
InferenceServiceDeployed model resource
Model LibraryManages model image versions
GPU NodesRun high-performance model serving pods
User ApplicationsCall model endpoints via REST/gRPC

Key components

  • InferenceService — Kubernetes CRD representing a deployed model.
  • ServingRuntime / ClusterServingRuntime — Define reusable runtime configurations.
  • Model containers — Currently focused on NVIDIA NIM containers in AI Factory 1.2.
  • Observability — Integrated Prometheus-compatible metrics, Kubernetes logging.

Supported models

AI Factory Model Serving currently supports NVIDIA NIM containers for:

Model TypeExample Usage
Text CompletionLLM agents, Assistants
Text EmbeddingsKnowledge Bases, RAG
Text RerankingRAG pipelines
Image EmbeddingsMulti-modal search
Image OCRDocument extraction

See: Supported Models


Deployment architecture

Applications → Model Endpoints (REST/gRPC) → KServe → GPU-enabled Kubernetes → Model Containers

  • Each model is isolated in its own InferenceService.
  • KServe manages:
  • Model lifecycle (start, stop, update)
  • Scaling (including scale-to-zero)
  • Endpoint routing (REST/gRPC)
  • GPU resources are provisioned and scheduled via Hybrid Manager integration.

Patterns of use

Gen AI Builder

  • LLM endpoints power Assistants and Agents.
  • Embedding models support hybrid RAG pipelines.

Knowledge Bases

  • Embedding models serve vectorization needs.
  • Retrieval and reranking models power semantic search pipelines.

Custom applications

  • Business applications can consume InferenceService endpoints for:
  • Real-time predictions
  • Image analysis
  • Text processing

Best practices

  • Deploy models via the Model Library → Model Serving flow to ensure governance.
  • Use ClusterServingRuntime for reusable runtime configs.
  • Monitor GPU utilization and model latency closely.
  • Test scale-to-zero configurations for readiness in production.
  • Ensure Model Library tags are versioned and documented.
  • Regularly audit deployed InferenceServices as part of Sovereign AI governance.

Summary

Model Serving in AI Factory provides a robust, scalable architecture for serving production AI models:

  • Kubernetes-native serving with KServe
  • GPU acceleration and optimized serving runtimes
  • Integrated observability and governance
  • Tight integration with AI Factory components: Gen AI Builder, Knowledge Bases, custom AI pipelines

Model Serving helps you implement Sovereign AI — with your models, on your infrastructure, under your control.


Next steps


Model Serving gives you a powerful foundation for building intelligent applications and data products — securely, scalably, and under your governance — as part of EDB Postgres® AI.