A Flask-based microservice for generating text and image embeddings using SentenceTransformers models. Supports text-only models (default) and multimodal CLIP models for cross-modal search. Optimized for offline operation with models bundled in the Docker image.
- Offline Operation: Models are pre-downloaded during build, no internet required at runtime
- Fast Cold Starts: Models bundled in image eliminate download time
- Configurable Models: Use any SentenceTransformers model via build arguments
- Multimodal Support: Optional CLIP models for image + text embeddings in unified vector space
- ~7,500 tokens/sec: Optimized throughput on CPU with cached tokenizer
- Batch API: Process up to 100 texts or 20 images in a single request
- Google Cloud Run Ready: Optimized for serverless deployment
The service supports customizing the embedding model at build time:
| Argument | Default | Description |
|---|---|---|
EMBEDDING_MODEL |
multi-qa-MiniLM-L6-cos-v1 |
SentenceTransformers model for generating embeddings |
TOKENIZER_MODEL |
sentence-transformers/multi-qa-MiniLM-L6-cos-v1 |
HuggingFace tokenizer model (should match embedding model) |
| Model | Size | Use Case |
|---|---|---|
multi-qa-MiniLM-L6-cos-v1 (default) |
~90MB | Question answering, semantic search |
all-MiniLM-L6-v2 |
~80MB | General purpose, fast inference |
all-mpnet-base-v2 |
~420MB | High quality, slower inference |
paraphrase-multilingual-MiniLM-L12-v2 |
~470MB | Multilingual support (50+ languages) |
| Model | Size | Embedding Dim | Notes |
|---|---|---|---|
clip-ViT-L-14 |
~890MB | 768 | Best accuracy (75.4% ImageNet) |
clip-ViT-B-16 |
~587MB | 512 | Good balance |
clip-ViT-B-32 |
~338MB | 512 | Fastest CLIP model |
See SentenceTransformers documentation for more models.
docker build -t vector-embedder-microservice .# Using all-MiniLM-L6-v2 (general purpose) docker build \ --build-arg EMBEDDING_MODEL=all-MiniLM-L6-v2 \ --build-arg TOKENIZER_MODEL=sentence-transformers/all-MiniLM-L6-v2 \ -t vector-embedder-microservice . # Using all-mpnet-base-v2 (higher quality) docker build \ --build-arg EMBEDDING_MODEL=all-mpnet-base-v2 \ --build-arg TOKENIZER_MODEL=sentence-transformers/all-mpnet-base-v2 \ -t vector-embedder-microservice . # Using multilingual model docker build \ --build-arg EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2 \ --build-arg TOKENIZER_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 \ -t vector-embedder-microservice .
Build with a CLIP model for cross-modal image/text search:
# Build multimodal image with CLIP docker build \ --build-arg EMBEDDING_MODEL=clip-ViT-L-14 \ --build-arg TOKENIZER_MODEL= \ -t vector-embedder-microservice-multimodal .
Note: CLIP models don't need a separate tokenizer (set TOKENIZER_MODEL= empty). Memory requirement increases to ~2GB.
Public Docker images are automatically built and published to GitHub Container Registry (ghcr.io) whenever changes are pushed to the main branch.
# Pull the latest image (no authentication needed for public images) docker pull ghcr.io/OWNER/REPO:latest # Run the container docker run -d \ -e PORT=5001 \ -e VECTOR_EMBEDDER_API_KEY=your-api-key \ -p 5001:5001 \ ghcr.io/OWNER/REPO:latest
Replace OWNER/REPO with your GitHub username and repository name (e.g., jman/vectorembeddermicroservice).
latest- Latest build from main branchmain- Same as latestv1.0.0- Specific version tagsv1.0- Minor version tagsv1- Major version tagsmain-sha-abc1234- Specific commit SHA
After the first build, you need to make the package public:
- Go to your GitHub repository
- Click Packages in the right sidebar
- Click on your package name
- Click Package settings (bottom of right sidebar)
- Scroll to Danger Zone
- Click Change visibility → Public
- Type the package name to confirm
Once public, anyone can pull the image without authentication.
The Docker image is automatically built and published by GitHub Actions:
Automatic triggers:
- Push to main → Builds
latestandmaintags - Git tags (e.g.,
v1.0.0) → Builds versioned tags - Pull requests → Builds image but doesn't push
Manual builds: You can trigger a manual build with custom model selection:
- Go to Actions tab in GitHub
- Click Build and Publish Docker Image workflow
- Click Run workflow
- Optionally specify custom embedding and tokenizer models
- Click Run workflow
Creating versioned releases:
# Tag a version git tag v1.0.0 git push origin v1.0.0 # This automatically builds and publishes: # - ghcr.io/OWNER/REPO:v1.0.0 # - ghcr.io/OWNER/REPO:v1.0 # - ghcr.io/OWNER/REPO:v1 # - ghcr.io/OWNER/REPO:latest (if on main branch)
To build images with different embedding models:
Via GitHub Actions (recommended):
- Go to Actions → Build and Publish Docker Image
- Click Run workflow
- Set custom model parameters:
- Embedding model:
all-mpnet-base-v2 - Tokenizer model:
sentence-transformers/all-mpnet-base-v2
- Embedding model:
- Run workflow
This creates a tagged image with your custom model that you can reference by commit SHA.
You have two options for deploying to Google Cloud Run:
Deploy directly from the public ghcr.io image:
gcloud run deploy vector-embedder-microservice \ --image ghcr.io/OWNER/REPO:latest \ --region us-central1 \ --memory 1Gi \ --cpu 2 \ --allow-unauthenticated \ --set-env-vars VECTOR_EMBEDDER_API_KEY=your-api-key
This pulls the pre-built image from GitHub Container Registry, no build required!
If you prefer to use Google's registry:
# Build locally docker build -t vector-embedder-microservice . # Tag for Google Artifact Registry docker tag vector-embedder-microservice \ us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice # Push to registry docker push us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice # Deploy to Cloud Run gcloud run deploy vector-embedder-microservice \ --image us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice \ --region us-central1 \ --memory 1Gi \ --cpu 2 \ --set-env-vars VECTOR_EMBEDDER_API_KEY=your-api-key
Replace YOUR-PROJECT-ID with your Google Cloud project ID.
- Memory: 512MB
- CPU: 1 vCPU
- Disk: 600MB
- Concurrency: 8 requests
- Memory: 1GB
- CPU: 2 vCPU
- Disk: 600MB-1GB (depending on model)
- Concurrency: 2 requests (2 workers ×ばつ 1 thread)
- Memory: 2GB+
- CPU: 4 vCPU
- Disk: 1GB+
- Concurrency: 4 requests (4 workers ×ばつ 1 thread)
Benchmarked on a single CPU thread (Intel/AMD circa 2024) with the default multi-qa-MiniLM-L6-cos-v1 model:
| Text Size | Throughput | Latency |
|---|---|---|
| 200 tokens | ~7,500 tok/s | 27ms |
| 500 tokens | ~7,500 tok/s | 67ms |
| 1000 tokens (2 chunks) | ~7,700 tok/s | 130ms |
Key factors:
- Throughput is consistent at ~7,500 tokens/sec regardless of text length
- Latency scales linearly with token count
- Texts >512 tokens are split into chunks (each chunk = one model forward pass)
- Use
/embeddings/batchfor multiple texts to reduce per-request overhead
Run python benchmark_embeddings.py for detailed benchmarks on your hardware.
# Run container without network access docker run --network none \ -e PORT=5001 \ -e VECTOR_EMBEDDER_API_KEY=test123 \ -p 5001:5001 \ vector-embedder-microservice # Test the endpoint curl -X POST http://localhost:5001/embeddings \ -H "Content-Type: application/json" \ -H "X-API-Key: test123" \ -d '{"text": "This is a test sentence"}'
Health endpoints for container orchestration and monitoring. No authentication required.
Liveness Probe: GET /health
Returns 200 immediately if the server is running. Use for Kubernetes/Cloud Run liveness probes.
curl http://localhost:5001/health
{"status": "ok"}Readiness Probe: GET /health/ready
Returns 200 when the model is loaded and ready to serve requests. Returns 503 if still initializing.
curl http://localhost:5001/health/ready
{
"status": "ready",
"model": "multi-qa-MiniLM-L6-cos-v1",
"supports_images": false
}Cloud Run Configuration:
# In your Cloud Run service YAML or via gcloud: --startup-cpu-boost \ --startup-probe-path=/health/ready \ --startup-probe-initial-delay=0 \ --startup-probe-timeout=240 \ --startup-probe-period=10 \ --liveness-probe-path=/health \ --liveness-probe-initial-delay=0 \ --liveness-probe-timeout=5 \ --liveness-probe-period=30
Endpoint: POST /embeddings
Headers:
Content-Type: application/jsonX-API-Key: <your-api-key>
Request Body:
{
"text": "Your text to embed"
}Response:
{
"embeddings": [[0.123, -0.456, 0.789, ...]]
}Endpoint: POST /embeddings/batch
Headers:
Content-Type: application/jsonX-API-Key: <your-api-key>
Request Body:
{
"texts": ["First text", "Second text", "Third text"]
}Response:
{
"embeddings": [[[...]], [[...]], [[...]]]
}Note: Output order matches input order — embeddings[i] corresponds to texts[i].
Limits: Maximum 100 texts per request (configurable via MAX_TEXTS_PER_BATCH).
Endpoint: POST /embeddings/image
Requires a multimodal model (e.g., clip-ViT-L-14). Returns 501 if using text-only model.
Headers:
Content-Type: application/jsonX-API-Key: <your-api-key>
Request Body:
{
"image": "<base64-encoded-image>"
}Response:
{
"embeddings": [[0.123, -0.456, 0.789, ...]]
}Endpoint: POST /embeddings/image/batch
Request Body:
{
"images": ["<base64-img1>", "<base64-img2>", "<base64-img3>"]
}Response:
{
"embeddings": [[[...]], [[...]], [[...]]]
}Limits: Maximum 20 images per request (configurable via MAX_IMAGES_PER_BATCH). Maximum 10MB per image (configurable via MAX_IMAGE_SIZE).
When using CLIP models, text and image embeddings are in the same vector space. You can:
- Search images using text queries
- Search text using image queries
- Compare similarity between any text/image combination
Example: Embed text "a cat on a couch" and an image of a cat on a couch — they'll have high cosine similarity.
When working with documents that contain both text and images:
1. Embed each modality separately
- Text →
POST /embeddings - Images →
POST /embeddings/image - Store both vectors linked to the same document for maximum flexibility
2. Store all vectors in the same index — Text and image embeddings share the same vector space and are directly comparable.
3. For combined document embeddings, you have options:
- Separate vectors (recommended): Store text and image embeddings as distinct vectors, both linked to the document. Allows querying by either modality.
- Averaged embedding: Combine via
(text_emb + image_emb) / 2. Simple but may dilute signal. - Weighted combination: Use
0.7 * text_emb + 0.3 * image_emb(adjust weights based on which modality matters more for your use case).
Cross-modal similarity example (measured with clip-ViT-B-32):
| Text Query | Image | Cosine Similarity |
|---|---|---|
| "a solid red color" | Red square | 0.294 |
| "a solid red color" | Blue square | 0.233 |
| "a solid blue color" | Blue square | 0.293 |
| "a solid blue color" | Red square | 0.237 |
Matching text-image pairs show ~25% higher similarity than non-matching pairs, enabling effective cross-modal retrieval.
| Variable | Required | Default | Description |
|---|---|---|---|
VECTOR_EMBEDDER_API_KEY |
No | abc123 |
API key for authentication |
PORT |
No | 5001 |
Port to run the service on |
MAX_BATCH_SIZE |
No | 8 |
Chunks per encode() call. Lower for large models to reduce memory. |
MAX_TEXTS_PER_BATCH |
No | 100 |
Maximum texts allowed per /embeddings/batch request. |
MAX_IMAGES_PER_BATCH |
No | 20 |
Maximum images allowed per /embeddings/image/batch request. |
MAX_IMAGE_SIZE |
No | 10485760 |
Maximum image size in bytes (default 10MB). |
EMBEDDING_MODEL |
No | From build arg | Override model at runtime (not recommended) |
TOKENIZER_MODEL |
No | From build arg | Override tokenizer at runtime (not recommended) |
Note: EMBEDDING_MODEL and TOKENIZER_MODEL are set during build. Only override at runtime if you have the desired models already cached in the image.
The project includes comprehensive unit tests for both the embedding logic and API endpoints.
Install development dependencies:
pip install -r requirements-dev.txt
Run all tests:
pytest
Run tests with coverage report:
pytest --cov=. --cov-report=html
Run specific test file:
pytest test_embeddings.py pytest test_main.py
View coverage report:
After running tests with coverage, open htmlcov/index.html in your browser to see a detailed coverage report.
-
test_embeddings.py: Tests for embedding generation and text chunking logic
- Model loading configuration
- Text chunking with various lengths
- Embedding generation and averaging
- Edge cases and error handling
-
test_image_embeddings.py: Tests for image embedding functions
- Image decoding and validation
- Image embedding generation
- Batch image processing
- Multimodal model detection
-
test_main.py: Tests for Flask API endpoints
- Authentication and API key validation
- Request/response format validation
- Error handling (missing fields, invalid data)
- HTTP method validation
- Image endpoint tests (501 for text-only models)
Tests run automatically on:
- Push to main/develop branches
- Pull requests to main/develop
- Manual workflow dispatch
The CI pipeline:
- Tests against Python 3.9, 3.10, and 3.11
- Runs linting checks with flake8
- Executes full test suite with pytest
- Generates and uploads coverage reports
- Archives coverage HTML report as artifact
View test results in the Actions tab of the GitHub repository.