Stay organized with collections
Save and categorize content based on your preferences.
The multimodal embeddings model generates 1408-dimension vectors based on the
input you provide, which can include a combination of image, text, and video
data. The embedding vectors can then be used for subsequent tasks like image
classification or video content moderation.
The image embedding vector and text embedding vector are in the same semantic
space with the same dimensionality. Consequently, these vectors can be used
interchangeably for use cases like searching image by text, or searching video
by image.
For text-only embedding use cases, we recommend using the Vertex AI
text-embeddings API instead. For example, the text-embeddings API might be
better for text-based semantic search, clustering, long-form document analysis,
and other text retrieval or question-answering use cases. For more information,
see Get text embeddings.
Supported models
You can get multimodal embeddings by using the following model:
multimodalembedding
Best practices
Consider the following input aspects when using the multimodal embeddings model:
Text in images - The model can distinguish text in images, similar to
optical character recognition (OCR). If you need to distinguish between a
description of the image content and the text within an image, consider
using prompt engineering to specify your target content.
For example: instead of just "cat", specify "picture of a cat" or
"the text 'cat'", depending on your use case.
Embedding similarities - The dot product of embeddings isn't a
calibrated probability. The dot product is a similarity metric and might have
different score distributions for different use cases. Consequently, avoid
using a fixed value threshold to measure quality. Instead, use ranking
approaches for retrieval, or use sigmoid for classification.
API usage
API limits
The following limits apply when you use the multimodalembedding model for
text and image embeddings:
Limit
Value and description
Text and image data
Maximum number of API requests per minute per project
120-600 depending on region
Maximum text length
32 tokens (~32 words)
The maximum text length is 32 tokens (approximately 32 words). If the input exceeds 32 tokens, the model internally shortens the input to this length.
The maximum image size accepted is 20 MB. To avoid increased network latency, use smaller images. Additionally, the model resizes images to 512 x 512 pixel resolution. Consequently, you don't need to provide higher resolution images.
Video data
Audio supported
N/A - The model doesn't consider audio content when generating video embeddings
Video formats
AVI, FLV, MKV, MOV, MP4, MPEG, MPG, WEBM, and WMV
Maximum video length (Cloud Storage)
No limit. However, only 2 minutes of content can be analyzed at a time.
Before you begin
Sign in to your Google Cloud account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get 300ドル in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator
(roles/resourcemanager.projectCreator), which contains the
resourcemanager.projects.create permission. Learn how to grant
roles.
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator
(roles/resourcemanager.projectCreator), which contains the
resourcemanager.projects.create permission. Learn how to grant
roles.
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
Select the tab for how you plan to use the samples on this page:
Java
To use the Java samples on this page in a local
development environment, install and initialize the gcloud CLI, and
then set up Application Default Credentials with your user credentials.
To use the Node.js samples on this page in a local
development environment, install and initialize the gcloud CLI, and
then set up Application Default Credentials with your user credentials.
To use the Python samples on this page in a local
development environment, install and initialize the gcloud CLI, and
then set up Application Default Credentials with your user credentials.
Optional. Review pricing for this
feature. Pricing for embeddings depends on the type of data you send
(such as image or text), and also depends on the mode you use for certain
data types (such as Video Plus, Video Standard, or Video Essential).
google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for
aiplatform.googleapis.com/online_prediction_requests_per_base_model with base
model: multimodalembedding. Please submit a quota increase request.
If this is the first time you receive this error, use the Google Cloud console
to request a quota adjustment for your project. Use the
following filters before requesting your adjustment:
If you have already sent a quota adjustment request, wait before sending another
request. If you need to further increase the quota, repeat the quota adjustment
request with your justification for a sustained quota adjustment.
Specify lower-dimension embeddings
By default an embedding request returns a 1408 float vector for a data type. You
can also specify lower-dimension embeddings (128, 256, or 512 float vectors) for
text and image data. This option lets you optimize for latency and storage or
quality based on how you plan to use the embeddings. Lower-dimension embeddings
provide decreased storage needs and lower latency for subsequent embedding tasks
(like search or recommendation), while higher-dimension embeddings offer greater
accuracy for the same tasks.
REST
Low-dimension can be accessed by adding the parameters.dimension field.
The parameter accepts one of the following values: 128, 256, 512 or
1408. The response includes the embedding of that dimension.
Before using any of the request data,
make the following replacements:
LOCATION: Your project's region. For example,
us-central1, europe-west2, or asia-northeast3. For a list
of available regions, see
Generative AI on Vertex AI locations.
TEXT: The target text to get embeddings for. For example,
a cat.
EMBEDDING_DIMENSION: The number of embedding dimensions. Lower values offer decreased
latency when using these embeddings for subsequent tasks, while higher values offer better
accuracy. Available values: 128,
256, 512, and 1408 (default).
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict
importvertexaifromvertexai.vision_modelsimportImage,MultiModalEmbeddingModel# TODO(developer): Update & uncomment line below# PROJECT_ID = "your-project-id"vertexai.init(project=PROJECT_ID,location="us-central1")# TODO(developer): Try different dimenions: 128, 256, 512, 1408embedding_dimension=128model=MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")image=Image.load_from_file("gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png")embeddings=model.get_embeddings(image=image,contextual_text="Colosseum",dimension=embedding_dimension,)print(f"Image Embedding: {embeddings.image_embedding}")print(f"Text Embedding: {embeddings.text_embedding}")# Example response:# Image Embedding: [0.0622573346, -0.0406507477, 0.0260440577, ...]# Text Embedding: [0.27469793, -0.146258667, 0.0222803634, ...]
Go
import("context""encoding/json""fmt""io"aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb""google.golang.org/api/option""google.golang.org/protobuf/encoding/protojson""google.golang.org/protobuf/types/known/structpb")// generateWithLowerDimension shows how to generate lower-dimensional embeddings for text and image inputs.funcgenerateWithLowerDimension(wio.Writer,project,locationstring)error{// location = "us-central1"ctx:=context.Background()apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)client,err:=aiplatform.NewPredictionClient(ctx,option.WithEndpoint(apiEndpoint))iferr!=nil{returnfmt.Errorf("failed to construct API client: %w",err)}deferclient.Close()model:="multimodalembedding@001"endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model)// This is the input to the model's prediction call. For schema, see:// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_bodyinstance,err:=structpb.NewValue(map[string]any{"image":map[string]any{// Image input can be provided either as a Google Cloud Storage URI or as// base64-encoded bytes using the "bytesBase64Encoded" field."gcsUri":"gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png",},"text":"Colosseum",})iferr!=nil{returnfmt.Errorf("failed to construct request payload: %w",err)}// TODO(developer): Try different dimenions: 128, 256, 512, 1408outputDimensionality:=128params,err:=structpb.NewValue(map[string]any{"dimension":outputDimensionality,})iferr!=nil{returnfmt.Errorf("failed to construct request params: %w",err)}req:=&aiplatformpb.PredictRequest{Endpoint:endpoint,// The model supports only 1 instance per request.Instances:[]*structpb.Value{instance},Parameters:params,}resp,err:=client.Predict(ctx,req)iferr!=nil{returnfmt.Errorf("failed to generate embeddings: %w",err)}instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0])iferr!=nil{returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err)}// For response schema, see:// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-bodyvarinstanceEmbeddingsstruct{ImageEmbeddings[]float32`json:"imageEmbedding"`TextEmbeddings[]float32`json:"textEmbedding"`}iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{returnfmt.Errorf("failed to unmarshal JSON: %w",err)}imageEmbedding:=instanceEmbeddings.ImageEmbeddingstextEmbedding:=instanceEmbeddings.TextEmbeddingsfmt.Fprintf(w,"Text embedding (length=%d): %v\n",len(textEmbedding),textEmbedding)fmt.Fprintf(w,"Image embedding (length=%d): %v\n",len(imageEmbedding),imageEmbedding)// Example response:// Text Embedding (length=128): [0.27469793 -0.14625867 0.022280363 ... ]// Image Embedding (length=128): [0.06225733 -0.040650766 0.02604402 ... ]returnnil}
Send an embedding request (image and text)
Use the following code samples to send an embedding request with image and text
data. The samples show how to send a request with both data types, but you can
also use the service with an individual data type.
When sending an embedding request you can specify an input video alone, or
you can specify a combination of video, image, and text data.
Video embedding modes
There are three modes you can use with video embeddings: Essential, Standard, or
Plus. The mode corresponds to the density of the embeddings generated, which can
be specified by the interval_sec config in the request. For each video
interval with interval_sec length, an embedding is generated. The minimal
video interval length is 4 seconds. Interval lengths greater than 120 seconds
might negatively affect the quality of the generated embeddings.
Pricing for video embedding depends on the mode you use. For more information,
see pricing.
The following table summarizes the three modes you can use for video embeddings:
Mode
Maximum number of embeddings per minute
Video embedding interval (minimum value)
Essential
4
15
This corresponds to: intervalSec >= 15
Standard
8
8
This corresponds to: 8 <= intervalSec < 15
Plus
15
4
This corresponds to: 4 <= intervalSec < 8
Video embeddings best practices
Consider the following when you send video embedding requests:
To generate a single embedding for the first two minutes of an input video
of any length, use the following videoSegmentConfig setting:
request.json:
// other request body content"videoSegmentConfig":{"intervalSec":120}// other request body content
To generate embedding for a video with a length greater than two minutes,
you can send multiple requests that specify the start and end times in the
videoSegmentConfig:
request1.json:
// other request body content"videoSegmentConfig":{"startOffsetSec":0,"endOffsetSec":120}// other request body content
request2.json:
// other request body content"videoSegmentConfig":{"startOffsetSec":120,"endOffsetSec":240}// other request body content
Get video embeddings
Use the following sample to get embeddings for video content alone.
For information about text-only use cases (text-based semantic search,
clustering, long-form document analysis, and other text retrieval or
question-answering use cases), read
Get text embeddings.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025年10月16日 UTC."],[],[]]