Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ArchieIndian/cortex

Repository files navigation


Run inference at scale

Cortex is an open source platform for large-scale inference workloads.


Model serving infrastructure

  • Supports deploying TensorFlow, PyTorch, sklearn and other models as realtime or batch APIs.
  • Ensures high availability with availability zones and automated instance restarts.
  • Runs inference on on-demand instances or spot instances with on-demand backups.
  • Autoscales to handle production workloads with support for overprovisioning.

Configure a cluster

# cluster.yaml
region: us-east-1
instance_type: g4dn.xlarge
min_instances: 10
max_instances: 100
spot: true

Spin up on your AWS or GCP account

$ cortex cluster up --config cluster.yaml
しろまる configuring autoscaling ✓
しろまる configuring networking ✓
しろまる configuring logging ✓
cortex is ready!

Reproducible deployments

  • Package dependencies, code, and configuration for reproducible deployments.
  • Configure compute, autoscaling, and networking for each API.
  • Integrate with your data science platform or CI/CD system.
  • Deploy custom Docker images or use the pre-built defaults.
  • Test locally before deploying to a cluster.

Define an API

class PythonPredictor:
 def __init__(self, config):
 from transformers import pipeline
 self.model = pipeline(task="text-generation")
 def predict(self, payload):
 return self.model(payload["text"])[0]
requirements = ["tensorflow", "transformers"]

Configure an API

api_spec = {
 "name": "text-generator",
 "kind": "RealtimeAPI",
 "compute": {
 "gpu": 1,
 "mem": "8Gi"
 },
 "autoscaling": {
 "min_replicas": 1,
 "max_replicas": 10
 }
}

Scalable machine learning APIs

  • Scale to handle production workloads with request-based autoscaling.
  • Stream performance metrics and logs to any monitoring tool.
  • Serve many models efficiently with multi-model caching.
  • Use rolling updates to update APIs without downtime.
  • Configure traffic splitting for A/B testing.

Deploy to your cluster

import cortex
cx = cortex.client("aws")
cx.create_api(api_spec, predictor=PythonPredictor, requirements=requirements)
# creating https://example.com/text-generator

Consume your API

$ curl https://example.com/text-generator -X POST -H "Content-Type: application/json" -d '{"text": "hello world"}'

Get started

About

Run inference at scale

Resources

License

Contributing

Stars

Watchers

Forks

Packages

Contributors

Languages

  • Go 69.4%
  • Python 15.2%
  • Jupyter Notebook 10.3%
  • Shell 3.3%
  • Dockerfile 0.8%
  • HTML 0.7%
  • Makefile 0.3%

AltStyle によって変換されたページ (->オリジナル) /