Name	Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows	.github/workflows
__pycache__	__pycache__
.coveragerc	.coveragerc
.flake8	.flake8
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
CLAUDE.md	CLAUDE.md
Dockerfile	Dockerfile
README.md	README.md
benchmark_embeddings.py	benchmark_embeddings.py
embeddings.py	embeddings.py
main.py	main.py
preload_models.py	preload_models.py
pytest.ini	pytest.ini
requirements-dev.txt	requirements-dev.txt
requirements.txt	requirements.txt
test_embeddings.py	test_embeddings.py
test_image_embeddings.py	test_image_embeddings.py
test_main.py	test_main.py

Vector Embedder Microservice

A Flask-based microservice for generating text and image embeddings using SentenceTransformers models. Supports text-only models (default) and multimodal CLIP models for cross-modal search. Optimized for offline operation with models bundled in the Docker image.

Features

Offline Operation: Models are pre-downloaded during build, no internet required at runtime
Fast Cold Starts: Models bundled in image eliminate download time
Configurable Models: Use any SentenceTransformers model via build arguments
Multimodal Support: Optional CLIP models for image + text embeddings in unified vector space
~7,500 tokens/sec: Optimized throughput on CPU with cached tokenizer
Batch API: Process up to 100 texts or 20 images in a single request
Google Cloud Run Ready: Optimized for serverless deployment

Build Arguments

The service supports customizing the embedding model at build time:

Available Build Arguments

Argument	Default	Description
`EMBEDDING_MODEL`	`multi-qa-MiniLM-L6-cos-v1`	SentenceTransformers model for generating embeddings
`TOKENIZER_MODEL`	`sentence-transformers/multi-qa-MiniLM-L6-cos-v1`	HuggingFace tokenizer model (should match embedding model)

Popular Model Options

Text-Only Models

Model	Size	Use Case
`multi-qa-MiniLM-L6-cos-v1` (default)	~90MB	Question answering, semantic search
`all-MiniLM-L6-v2`	~80MB	General purpose, fast inference
`all-mpnet-base-v2`	~420MB	High quality, slower inference
`paraphrase-multilingual-MiniLM-L12-v2`	~470MB	Multilingual support (50+ languages)

Multimodal Models (Image + Text)

Model	Size	Embedding Dim	Notes
`clip-ViT-L-14`	~890MB	768	Best accuracy (75.4% ImageNet)
`clip-ViT-B-16`	~587MB	512	Good balance
`clip-ViT-B-32`	~338MB	512	Fastest CLIP model

See SentenceTransformers documentation for more models.

Building the Image

Default Build (multi-qa-MiniLM-L6-cos-v1)

docker build -t vector-embedder-microservice .

Custom Model Build

# Using all-MiniLM-L6-v2 (general purpose)
docker build \
 --build-arg EMBEDDING_MODEL=all-MiniLM-L6-v2 \
 --build-arg TOKENIZER_MODEL=sentence-transformers/all-MiniLM-L6-v2 \
 -t vector-embedder-microservice .
# Using all-mpnet-base-v2 (higher quality)
docker build \
 --build-arg EMBEDDING_MODEL=all-mpnet-base-v2 \
 --build-arg TOKENIZER_MODEL=sentence-transformers/all-mpnet-base-v2 \
 -t vector-embedder-microservice .
# Using multilingual model
docker build \
 --build-arg EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2 \
 --build-arg TOKENIZER_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 \
 -t vector-embedder-microservice .

Multimodal Build (CLIP)

Build with a CLIP model for cross-modal image/text search:

# Build multimodal image with CLIP
docker build \
 --build-arg EMBEDDING_MODEL=clip-ViT-L-14 \
 --build-arg TOKENIZER_MODEL= \
 -t vector-embedder-microservice-multimodal .

Note: CLIP models don't need a separate tokenizer (set TOKENIZER_MODEL= empty). Memory requirement increases to ~2GB.

Using Pre-built Public Images

Public Docker images are automatically built and published to GitHub Container Registry (ghcr.io) whenever changes are pushed to the main branch.

Pull and Run from ghcr.io

# Pull the latest image (no authentication needed for public images)
docker pull ghcr.io/OWNER/REPO:latest
# Run the container
docker run -d \
 -e PORT=5001 \
 -e VECTOR_EMBEDDER_API_KEY=your-api-key \
 -p 5001:5001 \
 ghcr.io/OWNER/REPO:latest

Replace OWNER/REPO with your GitHub username and repository name (e.g., jman/vectorembeddermicroservice).

Available Tags

latest - Latest build from main branch
main - Same as latest
v1.0.0 - Specific version tags
v1.0 - Minor version tags
v1 - Major version tags
main-sha-abc1234 - Specific commit SHA

Making the Image Public

After the first build, you need to make the package public:

Go to your GitHub repository
Click Packages in the right sidebar
Click on your package name
Click Package settings (bottom of right sidebar)
Scroll to Danger Zone
Click Change visibility → Public
Type the package name to confirm

Once public, anyone can pull the image without authentication.

Automated Builds

The Docker image is automatically built and published by GitHub Actions:

Automatic triggers:

Push to main → Builds latest and main tags
Git tags (e.g., v1.0.0) → Builds versioned tags
Pull requests → Builds image but doesn't push

Manual builds: You can trigger a manual build with custom model selection:

Go to Actions tab in GitHub
Click Build and Publish Docker Image workflow
Click Run workflow
Optionally specify custom embedding and tokenizer models
Click Run workflow

Creating versioned releases:

# Tag a version
git tag v1.0.0
git push origin v1.0.0
# This automatically builds and publishes:
# - ghcr.io/OWNER/REPO:v1.0.0
# - ghcr.io/OWNER/REPO:v1.0
# - ghcr.io/OWNER/REPO:v1
# - ghcr.io/OWNER/REPO:latest (if on main branch)

Building Custom Model Variants

To build images with different embedding models:

Via GitHub Actions (recommended):

Go to Actions → Build and Publish Docker Image
Click Run workflow
Set custom model parameters:
- Embedding model: all-mpnet-base-v2
- Tokenizer model: sentence-transformers/all-mpnet-base-v2
Run workflow

This creates a tagged image with your custom model that you can reference by commit SHA.

Deploying to Google Cloud Run

You have two options for deploying to Google Cloud Run:

Option 1: Deploy from GitHub Container Registry (Easiest)

Deploy directly from the public ghcr.io image:

gcloud run deploy vector-embedder-microservice \
 --image ghcr.io/OWNER/REPO:latest \
 --region us-central1 \
 --memory 1Gi \
 --cpu 2 \
 --allow-unauthenticated \
 --set-env-vars VECTOR_EMBEDDER_API_KEY=your-api-key

This pulls the pre-built image from GitHub Container Registry, no build required!

Option 2: Build and Push to Google Artifact Registry

If you prefer to use Google's registry:

# Build locally
docker build -t vector-embedder-microservice .
# Tag for Google Artifact Registry
docker tag vector-embedder-microservice \
 us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice
# Push to registry
docker push us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice
# Deploy to Cloud Run
gcloud run deploy vector-embedder-microservice \
 --image us-central1-docker.pkg.dev/YOUR-PROJECT-ID/models/vector-embedder-microservice \
 --region us-central1 \
 --memory 1Gi \
 --cpu 2 \
 --set-env-vars VECTOR_EMBEDDER_API_KEY=your-api-key

Replace YOUR-PROJECT-ID with your Google Cloud project ID.

Resource Requirements

Minimum (default model)

Memory: 512MB
CPU: 1 vCPU
Disk: 600MB
Concurrency: 8 requests

Recommended (production)

Memory: 1GB
CPU: 2 vCPU
Disk: 600MB-1GB (depending on model)
Concurrency: 2 requests (2 workers ×ばつ 1 thread)

High Performance (larger models)

Memory: 2GB+
CPU: 4 vCPU
Disk: 1GB+
Concurrency: 4 requests (4 workers ×ばつ 1 thread)

Performance

Benchmarked on a single CPU thread (Intel/AMD circa 2024) with the default multi-qa-MiniLM-L6-cos-v1 model:

Text Size	Throughput	Latency
200 tokens	~7,500 tok/s	27ms
500 tokens	~7,500 tok/s	67ms
1000 tokens (2 chunks)	~7,700 tok/s	130ms

Key factors:

Throughput is consistent at ~7,500 tokens/sec regardless of text length
Latency scales linearly with token count
Texts >512 tokens are split into chunks (each chunk = one model forward pass)
Use /embeddings/batch for multiple texts to reduce per-request overhead

Run python benchmark_embeddings.py for detailed benchmarks on your hardware.

Testing Offline Capability

# Run container without network access
docker run --network none \
 -e PORT=5001 \
 -e VECTOR_EMBEDDER_API_KEY=test123 \
 -p 5001:5001 \
 vector-embedder-microservice
# Test the endpoint
curl -X POST http://localhost:5001/embeddings \
 -H "Content-Type: application/json" \
 -H "X-API-Key: test123" \
 -d '{"text": "This is a test sentence"}'

API Usage

Health Checks

Health endpoints for container orchestration and monitoring. No authentication required.

Liveness Probe: GET /health

Returns 200 immediately if the server is running. Use for Kubernetes/Cloud Run liveness probes.

curl http://localhost:5001/health

{"status": "ok"}

Readiness Probe: GET /health/ready

Returns 200 when the model is loaded and ready to serve requests. Returns 503 if still initializing.

curl http://localhost:5001/health/ready

{
 "status": "ready",
 "model": "multi-qa-MiniLM-L6-cos-v1",
 "supports_images": false
}

Cloud Run Configuration:

# In your Cloud Run service YAML or via gcloud:
--startup-cpu-boost \
--startup-probe-path=/health/ready \
--startup-probe-initial-delay=0 \
--startup-probe-timeout=240 \
--startup-probe-period=10 \
--liveness-probe-path=/health \
--liveness-probe-initial-delay=0 \
--liveness-probe-timeout=5 \
--liveness-probe-period=30

Generate Embeddings

Endpoint: POST /embeddings

Headers:

Content-Type: application/json
X-API-Key: <your-api-key>

Request Body:

{
 "text": "Your text to embed"
}

Response:

{
 "embeddings": [[0.123, -0.456, 0.789, ...]]
}

Batch Embeddings

Endpoint: POST /embeddings/batch

Headers:

Content-Type: application/json
X-API-Key: <your-api-key>

Request Body:

{
 "texts": ["First text", "Second text", "Third text"]
}

Response:

{
 "embeddings": [[[...]], [[...]], [[...]]]
}

Note: Output order matches input order — embeddings[i] corresponds to texts[i].

Limits: Maximum 100 texts per request (configurable via MAX_TEXTS_PER_BATCH).

Image Embeddings (Multimodal Only)

Endpoint: POST /embeddings/image

Requires a multimodal model (e.g., clip-ViT-L-14). Returns 501 if using text-only model.

Headers:

Content-Type: application/json
X-API-Key: <your-api-key>

Request Body:

{
 "image": "<base64-encoded-image>"
}

Response:

{
 "embeddings": [[0.123, -0.456, 0.789, ...]]
}

Batch Image Embeddings (Multimodal Only)

Endpoint: POST /embeddings/image/batch

Request Body:

{
 "images": ["<base64-img1>", "<base64-img2>", "<base64-img3>"]
}

Response:

{
 "embeddings": [[[...]], [[...]], [[...]]]
}

Limits: Maximum 20 images per request (configurable via MAX_IMAGES_PER_BATCH). Maximum 10MB per image (configurable via MAX_IMAGE_SIZE).

Cross-Modal Search

When using CLIP models, text and image embeddings are in the same vector space. You can:

Search images using text queries
Search text using image queries
Compare similarity between any text/image combination

Example: Embed text "a cat on a couch" and an image of a cat on a couch — they'll have high cosine similarity.

Best Practices for Mixed Text and Image Data

When working with documents that contain both text and images:

1. Embed each modality separately

Text → POST /embeddings
Images → POST /embeddings/image
Store both vectors linked to the same document for maximum flexibility

2. Store all vectors in the same index — Text and image embeddings share the same vector space and are directly comparable.

3. For combined document embeddings, you have options:

Separate vectors (recommended): Store text and image embeddings as distinct vectors, both linked to the document. Allows querying by either modality.
Averaged embedding: Combine via (text_emb + image_emb) / 2. Simple but may dilute signal.
Weighted combination: Use 0.7 * text_emb + 0.3 * image_emb (adjust weights based on which modality matters more for your use case).

Cross-modal similarity example (measured with clip-ViT-B-32):

Text Query	Image	Cosine Similarity
"a solid red color"	Red square	0.294
"a solid red color"	Blue square	0.233
"a solid blue color"	Blue square	0.293
"a solid blue color"	Red square	0.237

Matching text-image pairs show ~25% higher similarity than non-matching pairs, enabling effective cross-modal retrieval.

Configuration

Environment Variables

Variable	Required	Default	Description
`VECTOR_EMBEDDER_API_KEY`	No	`abc123`	API key for authentication
`PORT`	No	`5001`	Port to run the service on
`MAX_BATCH_SIZE`	No	`8`	Chunks per encode() call. Lower for large models to reduce memory.
`MAX_TEXTS_PER_BATCH`	No	`100`	Maximum texts allowed per `/embeddings/batch` request.
`MAX_IMAGES_PER_BATCH`	No	`20`	Maximum images allowed per `/embeddings/image/batch` request.
`MAX_IMAGE_SIZE`	No	`10485760`	Maximum image size in bytes (default 10MB).
`EMBEDDING_MODEL`	No	From build arg	Override model at runtime (not recommended)
`TOKENIZER_MODEL`	No	From build arg	Override tokenizer at runtime (not recommended)

Note: EMBEDDING_MODEL and TOKENIZER_MODEL are set during build. Only override at runtime if you have the desired models already cached in the image.

Development

Running Tests Locally

The project includes comprehensive unit tests for both the embedding logic and API endpoints.

Install development dependencies:

pip install -r requirements-dev.txt

Run all tests:

pytest

Run tests with coverage report:

pytest --cov=. --cov-report=html

Run specific test file:

pytest test_embeddings.py
pytest test_main.py

View coverage report: After running tests with coverage, open htmlcov/index.html in your browser to see a detailed coverage report.

Test Structure

test_embeddings.py: Tests for embedding generation and text chunking logic
- Model loading configuration
- Text chunking with various lengths
- Embedding generation and averaging
- Edge cases and error handling
test_image_embeddings.py: Tests for image embedding functions
- Image decoding and validation
- Image embedding generation
- Batch image processing
- Multimodal model detection
test_main.py: Tests for Flask API endpoints
- Authentication and API key validation
- Request/response format validation
- Error handling (missing fields, invalid data)
- HTTP method validation
- Image endpoint tests (501 for text-only models)

Continuous Integration

Tests run automatically on:

Push to main/develop branches
Pull requests to main/develop
Manual workflow dispatch

The CI pipeline:

Tests against Python 3.9, 3.10, and 3.11
Runs linting checks with flake8
Executes full test suite with pytest
Generates and uploads coverage reports
Archives coverage HTML report as artifact

View test results in the Actions tab of the GitHub repository.

JSv4/VectorEmbedderMicroservice

Folders and files

Latest commit

History

Repository files navigation

Vector Embedder Microservice

Features

Build Arguments

Available Build Arguments

Popular Model Options

Text-Only Models

Multimodal Models (Image + Text)

Building the Image

Default Build (multi-qa-MiniLM-L6-cos-v1)

Custom Model Build

Multimodal Build (CLIP)

Using Pre-built Public Images

Pull and Run from ghcr.io

Available Tags

Making the Image Public

Automated Builds

Building Custom Model Variants

Deploying to Google Cloud Run

Option 1: Deploy from GitHub Container Registry (Easiest)

Option 2: Build and Push to Google Artifact Registry

Resource Requirements

Minimum (default model)

Recommended (production)

High Performance (larger models)

Performance

Testing Offline Capability

API Usage

Health Checks

Generate Embeddings

Batch Embeddings

Image Embeddings (Multimodal Only)

Batch Image Embeddings (Multimodal Only)

Cross-Modal Search

Best Practices for Mixed Text and Image Data

Configuration

Environment Variables

Development

Running Tests Locally

Test Structure

Continuous Integration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages