Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

JSv4/VectorEmbedderMicroservice

Repository files navigation

Vector Embedder Microservice

A Flask-based microservice for generating text and image embeddings using SentenceTransformers models. Supports text-only models (default) and multimodal CLIP models for cross-modal search. Optimized for offline operation with models bundled in the Docker image.

Features

  • Offline Operation: Models are pre-downloaded during build, no internet required at runtime
  • Fast Cold Starts: Models bundled in image eliminate download time
  • Configurable Models: Use any SentenceTransformers model via build arguments
  • Multimodal Support: Optional CLIP models for image + text embeddings in unified vector space
  • ~7,500 tokens/sec: Optimized throughput on CPU with cached tokenizer
  • Batch API: Process up to 100 texts or 20 images in a single request
  • Google Cloud Run Ready: Optimized for serverless deployment

Build Arguments

The service supports customizing the embedding model at build time:

Available Build Arguments

Argument Default Description
EMBEDDING_MODEL multi-qa-MiniLM-L6-cos-v1 SentenceTransformers model for generating embeddings
TOKENIZER_MODEL sentence-transformers/multi-qa-MiniLM-L6-cos-v1 HuggingFace tokenizer model (should match embedding model)

Popular Model Options

Text-Only Models

Model Size Use Case
multi-qa-MiniLM-L6-cos-v1 (default) ~90MB Question answering, semantic search
all-MiniLM-L6-v2 ~80MB General purpose, fast inference
all-mpnet-base-v2 ~420MB High quality, slower inference
paraphrase-multilingual-MiniLM-L12-v2 ~470MB Multilingual support (50+ languages)

Multimodal Models (Image + Text)

Model Size Embedding Dim Notes
clip-ViT-L-14 ~890MB 768 Best accuracy (75.4% ImageNet)
clip-ViT-B-16 ~587MB 512 Good balance
clip-ViT-B-32 ~338MB 512 Fastest CLIP model

See SentenceTransformers documentation for more models.

Building the Image

Default Build (multi-qa-MiniLM-L6-cos-v1)

docker build -t vector-embedder-microservice .

Custom Model Build

# Using all-MiniLM-L6-v2 (general purpose)
docker build \
 --build-arg EMBEDDING_MODEL=all-MiniLM-L6-v2 \
 --build-arg TOKENIZER_MODEL=sentence-transformers/all-MiniLM-L6-v2 \
 -t vector-embedder-microservice .
# Using all-mpnet-base-v2 (higher quality)
docker build \
 --build-arg EMBEDDING_MODEL=all-mpnet-base-v2 \
 --build-arg TOKENIZER_MODEL=sentence-transformers/all-mpnet-base-v2 \
 -t vector-embedder-microservice .
# Using multilingual model
docker build \
 --build-arg EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2 \
 --build-arg TOKENIZER_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 \
 -t vector-embedder-microservice .

Multimodal Build (CLIP)

Build with a CLIP model for cross-modal image/text search:

# Build multimodal image with CLIP
docker build \
 --build-arg EMBEDDING_MODEL=clip-ViT-L-14 \
 --build-arg TOKENIZER_MODEL= \
 -t vector-embedder-microservice-multimodal .

Note: CLIP models don't need a separate tokenizer (set TOKENIZER_MODEL= empty). Memory requirement increases to ~2GB.

Using Pre-built Public Images

Public Docker images are automatically built and published to GitHub Container Registry (ghcr.io) whenever changes are pushed to the main branch.

Pull and Run from ghcr.io

# Pull the latest image (no authentication needed for public images)
docker pull ghcr.io/OWNER/REPO:latest
# Run the container
docker run -d \
 -e PORT=5001 \
 -e VECTOR_EMBEDDER_API_KEY=your-api-key \
 -p 5001:5001 \
 ghcr.io/OWNER/REPO:latest

Replace OWNER/REPO with your GitHub username and repository name (e.g., jman/vectorembeddermicroservice).

Available Tags

  • latest - Latest build from main branch
  • main - Same as latest
  • v1.0.0 - Specific version tags
  • v1.0 - Minor version tags
  • v1 - Major version tags
  • main-sha-abc1234 - Specific commit SHA

Making the Image Public

After the first build, you need to make the package public:

  1. Go to your GitHub repository
  2. Click Packages in the right sidebar
  3. Click on your package name
  4. Click Package settings (bottom of right sidebar)
  5. Scroll to Danger Zone
  6. Click Change visibilityPublic
  7. Type the package name to confirm

Once public, anyone can pull the image without authentication.

Automated Builds

The Docker image is automatically built and published by GitHub Actions:

Automatic triggers:

  • Push to main → Builds latest and main tags
  • Git tags (e.g., v1.0.0) → Builds versioned tags
  • Pull requests → Builds image but doesn't push

Manual builds: You can trigger a manual build with custom model selection:

  1. Go to Actions tab in GitHub
  2. Click Build and Publish Docker Image workflow
  3. Click Run workflow
  4. Optionally specify custom embedding and tokenizer models
  5. Click Run workflow

Creating versioned releases:

# Tag a version
git tag v1.0.0
git push origin v1.0.0
# This automatically builds and publishes:
# - ghcr.io/OWNER/REPO:v1.0.0
# - ghcr.io/OWNER/REPO:v1.0
# - ghcr.io/OWNER/REPO:v1
# - ghcr.io/OWNER/REPO:latest (if on main branch)

Building Custom Model Variants

To build images with different embedding models:

Via GitHub Actions (recommended):

  1. Go to ActionsBuild and Publish Docker Image
  2. Click Run workflow
  3. Set custom model parameters:
    • Embedding model: all-mpnet-base-v2
    • Tokenizer model: sentence-transformers/all-mpnet-base-v2
  4. Run workflow

This creates a tagged image with your custom model that you can reference by commit SHA.

Deploying to Google Cloud Run

You have two options for deploying to Google Cloud Run:

Option 1: Deploy from GitHub Container Registry (Easiest)

Deploy directly from the public ghcr.io image:

gcloud run deploy vector-embedder-microservice \
 --image ghcr.io/OWNER/REPO:latest \
 --region us-central1 \
 --memory 1Gi \
 --cpu 2 \
 --allow-unauthenticated \
 --set-env-vars VECTOR_EMBEDDER_API_KEY=your-api-key

This pulls the pre-built image from GitHub Container Registry, no build required!

Option 2: Build and Push to Google Artifact Registry

If you prefer to use Google's registry:

# Build locally
docker build -t vector-embedder-microservice .
# Tag for Google Artifact Registry
docker tag vector-embedder-microservice \
 us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice
# Push to registry
docker push us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice
# Deploy to Cloud Run
gcloud run deploy vector-embedder-microservice \
 --image us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice \
 --region us-central1 \
 --memory 1Gi \
 --cpu 2 \
 --set-env-vars VECTOR_EMBEDDER_API_KEY=your-api-key

Replace YOUR-PROJECT-ID with your Google Cloud project ID.

Resource Requirements

Minimum (default model)

  • Memory: 512MB
  • CPU: 1 vCPU
  • Disk: 600MB
  • Concurrency: 8 requests

Recommended (production)

  • Memory: 1GB
  • CPU: 2 vCPU
  • Disk: 600MB-1GB (depending on model)
  • Concurrency: 2 requests (2 workers ×ばつ 1 thread)

High Performance (larger models)

  • Memory: 2GB+
  • CPU: 4 vCPU
  • Disk: 1GB+
  • Concurrency: 4 requests (4 workers ×ばつ 1 thread)

Performance

Benchmarked on a single CPU thread (Intel/AMD circa 2024) with the default multi-qa-MiniLM-L6-cos-v1 model:

Text Size Throughput Latency
200 tokens ~7,500 tok/s 27ms
500 tokens ~7,500 tok/s 67ms
1000 tokens (2 chunks) ~7,700 tok/s 130ms

Key factors:

  • Throughput is consistent at ~7,500 tokens/sec regardless of text length
  • Latency scales linearly with token count
  • Texts >512 tokens are split into chunks (each chunk = one model forward pass)
  • Use /embeddings/batch for multiple texts to reduce per-request overhead

Run python benchmark_embeddings.py for detailed benchmarks on your hardware.

Testing Offline Capability

# Run container without network access
docker run --network none \
 -e PORT=5001 \
 -e VECTOR_EMBEDDER_API_KEY=test123 \
 -p 5001:5001 \
 vector-embedder-microservice
# Test the endpoint
curl -X POST http://localhost:5001/embeddings \
 -H "Content-Type: application/json" \
 -H "X-API-Key: test123" \
 -d '{"text": "This is a test sentence"}'

API Usage

Health Checks

Health endpoints for container orchestration and monitoring. No authentication required.

Liveness Probe: GET /health

Returns 200 immediately if the server is running. Use for Kubernetes/Cloud Run liveness probes.

curl http://localhost:5001/health
{"status": "ok"}

Readiness Probe: GET /health/ready

Returns 200 when the model is loaded and ready to serve requests. Returns 503 if still initializing.

curl http://localhost:5001/health/ready
{
 "status": "ready",
 "model": "multi-qa-MiniLM-L6-cos-v1",
 "supports_images": false
}

Cloud Run Configuration:

# In your Cloud Run service YAML or via gcloud:
--startup-cpu-boost \
--startup-probe-path=/health/ready \
--startup-probe-initial-delay=0 \
--startup-probe-timeout=240 \
--startup-probe-period=10 \
--liveness-probe-path=/health \
--liveness-probe-initial-delay=0 \
--liveness-probe-timeout=5 \
--liveness-probe-period=30

Generate Embeddings

Endpoint: POST /embeddings

Headers:

  • Content-Type: application/json
  • X-API-Key: <your-api-key>

Request Body:

{
 "text": "Your text to embed"
}

Response:

{
 "embeddings": [[0.123, -0.456, 0.789, ...]]
}

Batch Embeddings

Endpoint: POST /embeddings/batch

Headers:

  • Content-Type: application/json
  • X-API-Key: <your-api-key>

Request Body:

{
 "texts": ["First text", "Second text", "Third text"]
}

Response:

{
 "embeddings": [[[...]], [[...]], [[...]]]
}

Note: Output order matches input order — embeddings[i] corresponds to texts[i].

Limits: Maximum 100 texts per request (configurable via MAX_TEXTS_PER_BATCH).

Image Embeddings (Multimodal Only)

Endpoint: POST /embeddings/image

Requires a multimodal model (e.g., clip-ViT-L-14). Returns 501 if using text-only model.

Headers:

  • Content-Type: application/json
  • X-API-Key: <your-api-key>

Request Body:

{
 "image": "<base64-encoded-image>"
}

Response:

{
 "embeddings": [[0.123, -0.456, 0.789, ...]]
}

Batch Image Embeddings (Multimodal Only)

Endpoint: POST /embeddings/image/batch

Request Body:

{
 "images": ["<base64-img1>", "<base64-img2>", "<base64-img3>"]
}

Response:

{
 "embeddings": [[[...]], [[...]], [[...]]]
}

Limits: Maximum 20 images per request (configurable via MAX_IMAGES_PER_BATCH). Maximum 10MB per image (configurable via MAX_IMAGE_SIZE).

Cross-Modal Search

When using CLIP models, text and image embeddings are in the same vector space. You can:

  • Search images using text queries
  • Search text using image queries
  • Compare similarity between any text/image combination

Example: Embed text "a cat on a couch" and an image of a cat on a couch — they'll have high cosine similarity.

Best Practices for Mixed Text and Image Data

When working with documents that contain both text and images:

1. Embed each modality separately

  • Text → POST /embeddings
  • Images → POST /embeddings/image
  • Store both vectors linked to the same document for maximum flexibility

2. Store all vectors in the same index — Text and image embeddings share the same vector space and are directly comparable.

3. For combined document embeddings, you have options:

  • Separate vectors (recommended): Store text and image embeddings as distinct vectors, both linked to the document. Allows querying by either modality.
  • Averaged embedding: Combine via (text_emb + image_emb) / 2. Simple but may dilute signal.
  • Weighted combination: Use 0.7 * text_emb + 0.3 * image_emb (adjust weights based on which modality matters more for your use case).

Cross-modal similarity example (measured with clip-ViT-B-32):

Text Query Image Cosine Similarity
"a solid red color" Red square 0.294
"a solid red color" Blue square 0.233
"a solid blue color" Blue square 0.293
"a solid blue color" Red square 0.237

Matching text-image pairs show ~25% higher similarity than non-matching pairs, enabling effective cross-modal retrieval.

Configuration

Environment Variables

Variable Required Default Description
VECTOR_EMBEDDER_API_KEY No abc123 API key for authentication
PORT No 5001 Port to run the service on
MAX_BATCH_SIZE No 8 Chunks per encode() call. Lower for large models to reduce memory.
MAX_TEXTS_PER_BATCH No 100 Maximum texts allowed per /embeddings/batch request.
MAX_IMAGES_PER_BATCH No 20 Maximum images allowed per /embeddings/image/batch request.
MAX_IMAGE_SIZE No 10485760 Maximum image size in bytes (default 10MB).
EMBEDDING_MODEL No From build arg Override model at runtime (not recommended)
TOKENIZER_MODEL No From build arg Override tokenizer at runtime (not recommended)

Note: EMBEDDING_MODEL and TOKENIZER_MODEL are set during build. Only override at runtime if you have the desired models already cached in the image.

Development

Running Tests Locally

The project includes comprehensive unit tests for both the embedding logic and API endpoints.

Install development dependencies:

pip install -r requirements-dev.txt

Run all tests:

pytest

Run tests with coverage report:

pytest --cov=. --cov-report=html

Run specific test file:

pytest test_embeddings.py
pytest test_main.py

View coverage report: After running tests with coverage, open htmlcov/index.html in your browser to see a detailed coverage report.

Test Structure

  • test_embeddings.py: Tests for embedding generation and text chunking logic

    • Model loading configuration
    • Text chunking with various lengths
    • Embedding generation and averaging
    • Edge cases and error handling
  • test_image_embeddings.py: Tests for image embedding functions

    • Image decoding and validation
    • Image embedding generation
    • Batch image processing
    • Multimodal model detection
  • test_main.py: Tests for Flask API endpoints

    • Authentication and API key validation
    • Request/response format validation
    • Error handling (missing fields, invalid data)
    • HTTP method validation
    • Image endpoint tests (501 for text-only models)

Continuous Integration

Tests run automatically on:

  • Push to main/develop branches
  • Pull requests to main/develop
  • Manual workflow dispatch

The CI pipeline:

  1. Tests against Python 3.9, 3.10, and 3.11
  2. Runs linting checks with flake8
  3. Executes full test suite with pytest
  4. Generates and uploads coverage reports
  5. Archives coverage HTML report as artifact

View test results in the Actions tab of the GitHub repository.

About

Flask microservice to create vector embedders

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

AltStyle によって変換されたページ (->オリジナル) /