Get multimodal embeddings

The multimodal embeddings model generates 1408-dimension vectors based on the input you provide, which can include a combination of image, text, and video data. The embedding vectors can then be used for subsequent tasks like image classification or video content moderation.

The image embedding vector and text embedding vector are in the same semantic space with the same dimensionality. Consequently, these vectors can be used interchangeably for use cases like searching image by text, or searching video by image.

For text-only embedding use cases, we recommend using the Vertex AI text-embeddings API instead. For example, the text-embeddings API might be better for text-based semantic search, clustering, long-form document analysis, and other text retrieval or question-answering use cases. For more information, see Get text embeddings.

Supported models

You can get multimodal embeddings by using the following model:

multimodalembedding

Best practices

Consider the following input aspects when using the multimodal embeddings model:

Text in images - The model can distinguish text in images, similar to optical character recognition (OCR). If you need to distinguish between a description of the image content and the text within an image, consider using prompt engineering to specify your target content. For example: instead of just "cat", specify "picture of a cat" or "the text 'cat'", depending on your use case.

the text 'cat'

image of text with the word cat

picture of a cat

image of a cat

^{Image credit: Manja Vitolic
on Unsplash.}

Embedding similarities - The dot product of embeddings isn't a calibrated probability. The dot product is a similarity metric and might have different score distributions for different use cases. Consequently, avoid using a fixed value threshold to measure quality. Instead, use ranking approaches for retrieval, or use sigmoid for classification.

API usage

API limits

The following limits apply when you use the multimodalembedding model for text and image embeddings:

Limit Value and description

Text and image data

Maximum number of API requests per minute per project 120-600 depending on region

Maximum text length 32 tokens (~32 words)

The maximum text length is 32 tokens (approximately 32 words). If the input exceeds 32 tokens, the model internally shortens the input to this length.

Language English

Image formats BMP, GIF, JPG, PNG

Image size Base64-encoded images: 20 MB (when transcoded to PNG)
Cloud Storage images: 20MB (original file format)

The maximum image size accepted is 20 MB. To avoid increased network latency, use smaller images. Additionally, the model resizes images to 512 x 512 pixel resolution. Consequently, you don't need to provide higher resolution images.

Video data

Audio supported N/A - The model doesn't consider audio content when generating video embeddings

Video formats AVI, FLV, MKV, MOV, MP4, MPEG, MPG, WEBM, and WMV

Maximum video length (Cloud Storage) No limit. However, only 2 minutes of content can be analyzed at a time.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get 300ドル in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.

Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.

Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

Certain tasks in Vertex AI require that you use additional Google Cloud products besides Vertex AI. For example, in most cases, you must use Cloud Storage and Artifact Registry when you create a custom training pipeline. You might need to perform additional setup tasks to use other Google Cloud products.

Set up authentication for your environment.

Select the tab for how you plan to use the samples on this page:

Java

To use the Java samples on this page in a local development environment, install and initialize the gcloud CLI, and then set up Application Default Credentials with your user credentials.

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

After initializing the gcloud CLI, update it and install the required components:

gcloudcomponentsupdate gcloudcomponentsinstallbeta

If you're using a local shell, then create local authentication credentials for your user account:

gcloudauthapplication-defaultlogin

You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

For more information, see Set up ADC for a local development environment in the Google Cloud authentication documentation.

Node.js

To use the Node.js samples on this page in a local development environment, install and initialize the gcloud CLI, and then set up Application Default Credentials with your user credentials.

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

After initializing the gcloud CLI, update it and install the required components:

gcloudcomponentsupdate gcloudcomponentsinstallbeta

If you're using a local shell, then create local authentication credentials for your user account:

gcloudauthapplication-defaultlogin

You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

For more information, see Set up ADC for a local development environment in the Google Cloud authentication documentation.

Python

To use the Python samples on this page in a local development environment, install and initialize the gcloud CLI, and then set up Application Default Credentials with your user credentials.

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

After initializing the gcloud CLI, update it and install the required components:

gcloudcomponentsupdate gcloudcomponentsinstallbeta

If you're using a local shell, then create local authentication credentials for your user account:

gcloudauthapplication-defaultlogin

You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

For more information, see Set up ADC for a local development environment in the Google Cloud authentication documentation.

REST

To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

After initializing the gcloud CLI, update it and install the required components:

gcloudcomponentsupdate gcloudcomponentsinstallbeta

For more information, see Authenticate for using REST in the Google Cloud authentication documentation.

To use the Python SDK, follow instructions at Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

Optional. Review pricing for this feature. Pricing for embeddings depends on the type of data you send (such as image or text), and also depends on the mode you use for certain data types (such as Video Plus, Video Standard, or Video Essential).

Locations

A location is a region you can specify in a request to control where data is stored at rest. For a list of available regions, see Generative AI on Vertex AI locations.

Error messages

Quota exceeded error

google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/online_prediction_requests_per_base_model with base model: multimodalembedding. Please submit a quota increase request.

If this is the first time you receive this error, use the Google Cloud console to request a quota adjustment for your project. Use the following filters before requesting your adjustment:

Service ID: aiplatform.googleapis.com

metric: aiplatform.googleapis.com/online_prediction_requests_per_base_model

base_model:multimodalembedding

Go to Quotas

If you have already sent a quota adjustment request, wait before sending another request. If you need to further increase the quota, repeat the quota adjustment request with your justification for a sustained quota adjustment.

Specify lower-dimension embeddings

By default an embedding request returns a 1408 float vector for a data type. You can also specify lower-dimension embeddings (128, 256, or 512 float vectors) for text and image data. This option lets you optimize for latency and storage or quality based on how you plan to use the embeddings. Lower-dimension embeddings provide decreased storage needs and lower latency for subsequent embedding tasks (like search or recommendation), while higher-dimension embeddings offer greater accuracy for the same tasks.

REST

Low-dimension can be accessed by adding the parameters.dimension field. The parameter accepts one of the following values: 128, 256, 512 or 1408. The response includes the embedding of that dimension.

Before using any of the request data, make the following replacements:

LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.

PROJECT_ID: Your Google Cloud project ID.

IMAGE_URI: The Cloud Storage URI of the target image to get embeddings for. For example, gs://my-bucket/embeddings/supermarket-img.png.
You can also provide the image as a base64-encoded byte string:

[...] "image": { "bytesBase64Encoded": "B64_ENCODED_IMAGE" } [...]

TEXT: The target text to get embeddings for. For example, a cat.

EMBEDDING_DIMENSION: The number of embedding dimensions. Lower values offer decreased latency when using these embeddings for subsequent tasks, while higher values offer better accuracy. Available values: 128, 256, 512, and 1408 (default).

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

Request JSON body:

{ "instances": [ { "image": { "gcsUri": "IMAGE_URI" }, "text": "TEXT" } ], "parameters": { "dimension": EMBEDDING_DIMENSION } }

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

The embedding the model returns a float vector of the dimension you specify. The following sample responses are shortened for space.

128 dimensions:

{ "predictions": [ { "imageEmbedding": [ 0.0279239565, [...128 dimension vector...] 0.00403284049 ], "textEmbedding": [ 0.202921599, [...128 dimension vector...] -0.0365431122 ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }

256 dimensions:

{ "predictions": [ { "imageEmbedding": [ 0.248620048, [...256 dimension vector...] -0.0646447465 ], "textEmbedding": [ 0.0757875815, [...256 dimension vector...] -0.02749932 ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }

512 dimensions:

{ "predictions": [ { "imageEmbedding": [ -0.0523675755, [...512 dimension vector...] -0.0444030389 ], "textEmbedding": [ -0.0592851527, [...512 dimension vector...] 0.0350437127 ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }

Python

importvertexai fromvertexai.vision_modelsimport Image, MultiModalEmbeddingModel # TODO(developer): Update & uncomment line below # PROJECT_ID = "your-project-id" vertexai.init(project=PROJECT_ID, location="us-central1") # TODO(developer): Try different dimenions: 128, 256, 512, 1408 embedding_dimension = 128 model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001") image = Image.load_from_file( "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png" ) embeddings = model.get_embeddings( image=image, contextual_text="Colosseum", dimension=embedding_dimension, ) print(f"Image Embedding: {embeddings.image_embedding}") print(f"Text Embedding: {embeddings.text_embedding}") # Example response: # Image Embedding: [0.0622573346, -0.0406507477, 0.0260440577, ...] # Text Embedding: [0.27469793, -0.146258667, 0.0222803634, ...]

Go

import( "context" "encoding/json" "fmt" "io" aiplatform"cloud.google.com/go/aiplatform/apiv1beta1" aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb" "google.golang.org/api/option" "google.golang.org/protobuf/encoding/protojson" "google.golang.org/protobuf/types/known/structpb" ) // generateWithLowerDimension shows how to generate lower-dimensional embeddings for text and image inputs. funcgenerateWithLowerDimension(wio.Writer,project,locationstring)error{ // location = "us-central1" ctx:=context.Background() apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location) client,err:=aiplatform.NewPredictionClient (ctx,option.WithEndpoint(apiEndpoint)) iferr!=nil{ returnfmt.Errorf("failed to construct API client: %w",err) } deferclient.Close() model:="multimodalembedding@001" endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model) // This is the input to the model's prediction call. For schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body instance,err:=structpb.NewValue(map[string]any{ "image":map[string]any{ // Image input can be provided either as a Google Cloud Storage URI or as // base64-encoded bytes using the "bytesBase64Encoded" field. "gcsUri":"gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png", }, "text":"Colosseum", }) iferr!=nil{ returnfmt.Errorf("failed to construct request payload: %w",err) } // TODO(developer): Try different dimenions: 128, 256, 512, 1408 outputDimensionality:=128 params,err:=structpb.NewValue(map[string]any{ "dimension":outputDimensionality, }) iferr!=nil{ returnfmt.Errorf("failed to construct request params: %w",err) } req:=&aiplatformpb.PredictRequest{ Endpoint:endpoint, // The model supports only 1 instance per request. Instances:[]*structpb.Value{instance}, Parameters:params, } resp,err:=client.Predict(ctx,req) iferr!=nil{ returnfmt.Errorf("failed to generate embeddings: %w",err) } instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0]) iferr!=nil{ returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err) } // For response schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body varinstanceEmbeddingsstruct{ ImageEmbeddings[]float32`json:"imageEmbedding"` TextEmbeddings[]float32`json:"textEmbedding"` } iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{ returnfmt.Errorf("failed to unmarshal JSON: %w",err) } imageEmbedding:=instanceEmbeddings.ImageEmbeddings textEmbedding:=instanceEmbeddings.TextEmbeddings fmt.Fprintf(w,"Text embedding (length=%d): %v\n",len(textEmbedding),textEmbedding) fmt.Fprintf(w,"Image embedding (length=%d): %v\n",len(imageEmbedding),imageEmbedding) // Example response: // Text Embedding (length=128): [0.27469793 -0.14625867 0.022280363 ... ] // Image Embedding (length=128): [0.06225733 -0.040650766 0.02604402 ... ] returnnil }

Send an embedding request (image and text)

Note: For text-only embedding use cases, we recommend using the Vertex AI text-embeddings API instead. For example, the text-embeddings API might be better for text-based semantic search, clustering, long-form document analysis, and other text retrieval or question-answering use cases. For more information, see Get text embeddings.

Use the following code samples to send an embedding request with image and text data. The samples show how to send a request with both data types, but you can also use the service with an individual data type.

Get text and image embeddings

REST

For more information about multimodalembedding model requests, see the multimodalembedding model API reference.

Before using any of the request data, make the following replacements:

LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.

PROJECT_ID: Your Google Cloud project ID.

TEXT: The target text to get embeddings for. For example, a cat.

B64_ENCODED_IMG: The target image to get embeddings for. The image must be specified as a base64-encoded byte string.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

Request JSON body:

{ "instances": [ { "text": "TEXT", "image": { "bytesBase64Encoded": "B64_ENCODED_IMG" } } ] }

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

The embedding the model returns is a 1408 float vector. The following sample response is shortened for space.

{ "predictions": [ { "textEmbedding": [ 0.010477379, -0.00399621, 0.00576670747, [...] -0.00823613815, -0.0169572588, -0.00472954148 ], "imageEmbedding": [ 0.00262696808, -0.00198890246, 0.0152047109, -0.0103145819, [...] 0.0324628279, 0.0284924973, 0.011650892, -0.00452344026 ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

importvertexai fromvertexai.vision_modelsimport Image, MultiModalEmbeddingModel # TODO(developer): Update & uncomment line below # PROJECT_ID = "your-project-id" vertexai.init(project=PROJECT_ID, location="us-central1") model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001") image = Image.load_from_file( "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png" ) embeddings = model.get_embeddings( image=image, contextual_text="Colosseum", dimension=1408, ) print(f"Image Embedding: {embeddings.image_embedding}") print(f"Text Embedding: {embeddings.text_embedding}") # Example response: # Image Embedding: [-0.0123147098, 0.0727171078, ...] # Text Embedding: [0.00230263756, 0.0278981831, ...]

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/** * TODO(developer): Uncomment these variables before running the sample.\ * (Not necessary if passing values as arguments) */ // const project = 'YOUR_PROJECT_ID'; // const location = 'YOUR_PROJECT_LOCATION'; // const baseImagePath = 'YOUR_BASE_IMAGE_PATH'; // const textPrompt = 'YOUR_TEXT_PROMPT'; constaiplatform=require('@google-cloud/aiplatform'); // Imports the Google Cloud Prediction service client const{PredictionServiceClient}=aiplatform.v1; // Import the helper module for converting arbitrary protobuf.Value objects. const{helpers}=aiplatform; // Specifies the location of the api endpoint constclientOptions={ apiEndpoint:'us-central1-aiplatform.googleapis.com', }; constpublisher='google'; constmodel='multimodalembedding@001'; // Instantiates a client constpredictionServiceClient=newPredictionServiceClient (clientOptions); asyncfunctionpredictImageFromImageAndText(){ // Configure the parent resource constendpoint=`projects/${project}/locations/${location}/publishers/${publisher}/models/${model}`; constfs=require('fs'); constimageFile=fs.readFileSync(baseImagePath); // Convert the image data to a Buffer and base64 encode it. constencodedImage=Buffer.from(imageFile).toString('base64'); constprompt={ text:textPrompt, image:{ bytesBase64Encoded:encodedImage, }, }; constinstanceValue=helpers .toValue(prompt); constinstances=[instanceValue]; constparameter={ sampleCount:1, }; constparameters=helpers .toValue(parameter); constrequest={ endpoint, instances, parameters, }; // Predict request const[response]=awaitpredictionServiceClient.predict(request); console.log('Get image embedding response'); constpredictions=response.predictions; console.log('\tPredictions :'); for(constpredictionofpredictions){ console.log(`\t\tPrediction : ${JSON.stringify(prediction)}`); } } awaitpredictImageFromImageAndText();

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

importcom.google.cloud.aiplatform.v1beta1.EndpointName; importcom.google.cloud.aiplatform.v1beta1.PredictResponse; importcom.google.cloud.aiplatform.v1beta1.PredictionServiceClient; importcom.google.cloud.aiplatform.v1beta1.PredictionServiceSettings; importcom.google.gson.Gson; importcom.google.gson.JsonObject; importcom.google.protobuf.InvalidProtocolBufferException ; importcom.google.protobuf.Value ; importcom.google.protobuf.util.JsonFormat ; importjava.io.IOException; importjava.nio.charset.StandardCharsets; importjava.nio.file.Files; importjava.nio.file.Paths; importjava.util.ArrayList; importjava.util.Base64; importjava.util.HashMap; importjava.util.List; importjava.util.Map; publicclass PredictImageFromImageAndTextSample{ publicstaticvoidmain(String[]args)throwsIOException{ // TODO(developer): Replace this variable before running the sample. Stringproject="YOUR_PROJECT_ID"; StringtextPrompt="YOUR_TEXT_PROMPT"; StringbaseImagePath="YOUR_BASE_IMAGE_PATH"; // Learn how to use text prompts to update an image: // https://cloud.google.com/vertex-ai/docs/generative-ai/image/edit-images Map<String,Object>parameters=newHashMap<String,Object>(); parameters.put("sampleCount",1); Stringlocation="us-central1"; Stringpublisher="google"; Stringmodel="multimodalembedding@001"; predictImageFromImageAndText( project,location,publisher,model,textPrompt,baseImagePath,parameters); } // Update images using text prompts publicstaticvoidpredictImageFromImageAndText( Stringproject, Stringlocation, Stringpublisher, Stringmodel, StringtextPrompt, StringbaseImagePath, Map<String,Object>parameters) throwsIOException{ finalStringendpoint=String.format("%s-aiplatform.googleapis.com:443",location); finalPredictionServiceSettingspredictionServiceSettings= PredictionServiceSettings.newBuilder().setEndpoint(endpoint).build(); // Initialize client that will be used to send requests. This client only needs to be created // once, and can be reused for multiple requests. try(PredictionServiceClientpredictionServiceClient= PredictionServiceClient.create(predictionServiceSettings)){ finalEndpointNameendpointName= EndpointName.ofProjectLocationPublisherModelName(project,location,publisher,model); // Convert the image to Base64 byte[]imageData=Base64.getEncoder().encode(Files.readAllBytes(Paths.get(baseImagePath))); StringencodedImage=newString(imageData,StandardCharsets.UTF_8); JsonObjectjsonInstance=newJsonObject(); jsonInstance.addProperty("text",textPrompt); JsonObjectjsonImage=newJsonObject(); jsonImage.addProperty("bytesBase64Encoded",encodedImage); jsonInstance.add("image",jsonImage); Value instanceValue=stringToValue(jsonInstance.toString()); List<Value> instances=newArrayList<>(); instances.add(instanceValue); Gsongson=newGson(); StringgsonString=gson.toJson(parameters); Value parameterValue=stringToValue(gsonString); PredictResponsepredictResponse= predictionServiceClient.predict(endpointName,instances,parameterValue); System.out.println("Predict Response"); System.out.println(predictResponse); for(Value prediction:predictResponse.getPredictionsList()){ System.out.format("\tPrediction: %s\n",prediction); } } } // Convert a Json string to a protobuf.Value staticValue stringToValue(Stringvalue)throwsInvalidProtocolBufferException { Value .Builderbuilder=Value .newBuilder(); JsonFormat .parser().merge(value,builder); returnbuilder .build(); } }

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import( "context" "encoding/json" "fmt" "io" aiplatform"cloud.google.com/go/aiplatform/apiv1beta1" aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb" "google.golang.org/api/option" "google.golang.org/protobuf/encoding/protojson" "google.golang.org/protobuf/types/known/structpb" ) // generateForTextAndImage shows how to use the multimodal model to generate embeddings for // text and image inputs. funcgenerateForTextAndImage(wio.Writer,project,locationstring)error{ // location = "us-central1" ctx:=context.Background() apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location) client,err:=aiplatform.NewPredictionClient (ctx,option.WithEndpoint(apiEndpoint)) iferr!=nil{ returnfmt.Errorf("failed to construct API client: %w",err) } deferclient.Close() model:="multimodalembedding@001" endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model) // This is the input to the model's prediction call. For schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body instance,err:=structpb.NewValue(map[string]any{ "image":map[string]any{ // Image input can be provided either as a Google Cloud Storage URI or as // base64-encoded bytes using the "bytesBase64Encoded" field. "gcsUri":"gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png", }, "text":"Colosseum", }) iferr!=nil{ returnfmt.Errorf("failed to construct request payload: %w",err) } req:=&aiplatformpb.PredictRequest{ Endpoint:endpoint, // The model supports only 1 instance per request. Instances:[]*structpb.Value{instance}, } resp,err:=client.Predict(ctx,req) iferr!=nil{ returnfmt.Errorf("failed to generate embeddings: %w",err) } instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0]) iferr!=nil{ returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err) } // For response schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body varinstanceEmbeddingsstruct{ ImageEmbeddings[]float32`json:"imageEmbedding"` TextEmbeddings[]float32`json:"textEmbedding"` } iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{ returnfmt.Errorf("failed to unmarshal JSON: %w",err) } imageEmbedding:=instanceEmbeddings.ImageEmbeddings textEmbedding:=instanceEmbeddings.TextEmbeddings fmt.Fprintf(w,"Text embedding (length=%d): %v\n",len(textEmbedding),textEmbedding) fmt.Fprintf(w,"Image embedding (length=%d): %v\n",len(imageEmbedding),imageEmbedding) // Example response: // Text embedding (length=1408): [0.0023026613 0.027898183 -0.011858357 ... ] // Image embedding (length=1408): [-0.012314269 0.07271844 0.00020170923 ... ] returnnil }

Send an embedding request (video, image, or text)

When sending an embedding request you can specify an input video alone, or you can specify a combination of video, image, and text data.

Video embedding modes

There are three modes you can use with video embeddings: Essential, Standard, or Plus. The mode corresponds to the density of the embeddings generated, which can be specified by the interval_sec config in the request. For each video interval with interval_sec length, an embedding is generated. The minimal video interval length is 4 seconds. Interval lengths greater than 120 seconds might negatively affect the quality of the generated embeddings.

Pricing for video embedding depends on the mode you use. For more information, see pricing.

The following table summarizes the three modes you can use for video embeddings:

Mode Maximum number of embeddings per minute Video embedding interval (minimum value)

Essential 4 15

This corresponds to: intervalSec >= 15

Standard 8 8

This corresponds to: 8 <= intervalSec < 15

Plus 15 4

This corresponds to: 4 <= intervalSec < 8

Video embeddings best practices

Consider the following when you send video embedding requests:

To generate a single embedding for the first two minutes of an input video of any length, use the following videoSegmentConfig setting:

request.json:

// other request body content "videoSegmentConfig":{ "intervalSec":120 } // other request body content

To generate embedding for a video with a length greater than two minutes, you can send multiple requests that specify the start and end times in the videoSegmentConfig:

request1.json:

// other request body content "videoSegmentConfig":{ "startOffsetSec":0, "endOffsetSec":120 } // other request body content

request2.json:

// other request body content "videoSegmentConfig":{ "startOffsetSec":120, "endOffsetSec":240 } // other request body content

Get video embeddings

Use the following sample to get embeddings for video content alone.

REST

For more information about multimodalembedding model requests, see the multimodalembedding model API reference.

The following example uses a video located in Cloud Storage. You can also use the video.bytesBase64Encoded field to provide a base64-encoded string representation of the video.

Before using any of the request data, make the following replacements:

LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.

PROJECT_ID: Your Google Cloud project ID.

VIDEO_URI: The Cloud Storage URI of the target video to get embeddings for. For example, gs://my-bucket/embeddings/supermarket-video.mp4.
You can also provide the video as a base64-encoded byte string:

[...] "video": { "bytesBase64Encoded": "B64_ENCODED_VIDEO" } [...]

videoSegmentConfig (START_SECOND, END_SECOND, INTERVAL_SECONDS). Optional. The specific video segments (in seconds) the embeddings are generated for.
The value you set for videoSegmentConfig.intervalSec affects the pricing tier you are charged at. For more information, see the video embedding modes section and pricing page.

For example:

[...] "videoSegmentConfig": { "startOffsetSec": 10, "endOffsetSec": 60, "intervalSec": 10 } [...]

Using this config specifies video data from 10 seconds to 60 seconds and generates embeddings for the following 10 second video intervals: [10, 20), [20, 30), [30, 40), [40, 50), [50, 60). This video interval ("intervalSec": 10) falls in the Standard video embedding mode, and the user is charged at the Standard mode pricing rate.

If you omit videoSegmentConfig, the service uses the following default values: "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }. This video interval ("intervalSec": 16) falls in the Essential video embedding mode, and the user is charged at the Essential mode pricing rate.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

Request JSON body:

{ "instances": [ { "video": { "gcsUri": "VIDEO_URI", "videoSegmentConfig": { "startOffsetSec": START_SECOND, "endOffsetSec": END_SECOND, "intervalSec": INTERVAL_SECONDS } } } ] }

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

The embedding the model returns is a 1408 float vector. The following sample responses are shortened for space.
Response (7 second video, no videoSegmentConfig specified):

{ "predictions": [ { "videoEmbeddings": [ { "endOffsetSec": 7, "embedding": [ -0.0045467657, 0.0258095954, 0.0146885719, 0.00945400633, [...] -0.0023291884, -0.00493789, 0.00975185353, 0.0168156829 ], "startOffsetSec": 0 } ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }

Response (59 second video, with the following video segment config: "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 60, "intervalSec": 10 }):

{ "predictions": [ { "videoEmbeddings": [ { "endOffsetSec": 10, "startOffsetSec": 0, "embedding": [ -0.00683252793, 0.0390476175, [...] 0.00657121744, 0.013023301 ] }, { "startOffsetSec": 10, "endOffsetSec": 20, "embedding": [ -0.0104404651, 0.0357737206, [...] 0.00509833824, 0.0131902946 ] }, { "startOffsetSec": 20, "embedding": [ -0.0113538112, 0.0305239167, [...] -0.00195809244, 0.00941874553 ], "endOffsetSec": 30 }, { "embedding": [ -0.00299320649, 0.0322436653, [...] -0.00993082579, 0.00968887936 ], "startOffsetSec": 30, "endOffsetSec": 40 }, { "endOffsetSec": 50, "startOffsetSec": 40, "embedding": [ -0.00591270532, 0.0368893594, [...] -0.00219071587, 0.0042470959 ] }, { "embedding": [ -0.00458270218, 0.0368121453, [...] -0.00317760976, 0.00595594104 ], "endOffsetSec": 59, "startOffsetSec": 50 } ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

importvertexai fromvertexai.vision_modelsimport MultiModalEmbeddingModel, Video fromvertexai.vision_modelsimport VideoSegmentConfig # TODO(developer): Update & uncomment line below # PROJECT_ID = "your-project-id" vertexai.init(project=PROJECT_ID, location="us-central1") model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001") embeddings = model.get_embeddings( video=Video.load_from_file( "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4" ), video_segment_config=VideoSegmentConfig(end_offset_sec=1), ) # Video Embeddings are segmented based on the video_segment_config. print("Video Embeddings:") for video_embedding in embeddings.video_embeddings: print( f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}" ) print(f"Embedding: {video_embedding.embedding}") # Example response: # Video Embeddings: # Video Segment: 0.0 - 1.0 # Embedding: [-0.0206376351, 0.0123456789, ...]

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import( "context" "encoding/json" "fmt" "io" "time" aiplatform"cloud.google.com/go/aiplatform/apiv1beta1" aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb" "google.golang.org/api/option" "google.golang.org/protobuf/encoding/protojson" "google.golang.org/protobuf/types/known/structpb" ) // generateForVideo shows how to use the multimodal model to generate embeddings for video input. funcgenerateForVideo(wio.Writer,project,locationstring)error{ // location = "us-central1" // The default context timeout may be not enough to process a video input. ctx,cancel:=context.WithTimeout(context.Background(),15*time.Second) defercancel() apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location) client,err:=aiplatform.NewPredictionClient (ctx,option.WithEndpoint(apiEndpoint)) iferr!=nil{ returnfmt.Errorf("failed to construct API client: %w",err) } deferclient.Close() model:="multimodalembedding@001" endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model) // This is the input to the model's prediction call. For schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body instances,err:=structpb.NewValue(map[string]any{ "video":map[string]any{ // Video input can be provided either as a Google Cloud Storage URI or as base64-encoded // bytes using the "bytesBase64Encoded" field. "gcsUri":"gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4", "videoSegmentConfig":map[string]any{ "startOffsetSec":1, "endOffsetSec":5, }, }, }) iferr!=nil{ returnfmt.Errorf("failed to construct request payload: %w",err) } req:=&aiplatformpb.PredictRequest{ Endpoint:endpoint, // The model supports only 1 instance per request. Instances:[]*structpb.Value{instances}, } resp,err:=client.Predict(ctx,req) iferr!=nil{ returnfmt.Errorf("failed to generate embeddings: %w",err) } instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0]) iferr!=nil{ returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err) } // For response schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body varinstanceEmbeddingsstruct{ VideoEmbeddings[]struct{ Embedding[]float32`json:"embedding"` StartOffsetSecfloat64`json:"startOffsetSec"` EndOffsetSecfloat64`json:"endOffsetSec"` }`json:"videoEmbeddings"` } iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{ returnfmt.Errorf("failed to unmarshal json: %w",err) } // Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per // each processed segment). videoEmbedding:=instanceEmbeddings.VideoEmbeddings[0] fmt.Fprintf(w,"Video embedding (seconds: %.f-%.f; length=%d): %v\n", videoEmbedding.StartOffsetSec, videoEmbedding.EndOffsetSec, len(videoEmbedding.Embedding), videoEmbedding.Embedding, ) // Example response: // Video embedding (seconds: 1-5; length=1408): [-0.016427778 0.032878537 -0.030755188 ... ] returnnil }

Get image, text, and video embeddings

Use the following sample to get embeddings for video, text, and image content.

REST

For more information about multimodalembedding model requests, see the multimodalembedding model API reference.

The following example uses image, text, and video data. You can use any combination of these data types in your request body.

Additionally, this sample uses a video located in Cloud Storage. You can also use the video.bytesBase64Encoded field to provide a base64-encoded string representation of the video.

Before using any of the request data, make the following replacements:

LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.

PROJECT_ID: Your Google Cloud project ID.

TEXT: The target text to get embeddings for. For example, a cat.

IMAGE_URI: The Cloud Storage URI of the target image to get embeddings for. For example, gs://my-bucket/embeddings/supermarket-img.png.
You can also provide the image as a base64-encoded byte string:

[...] "image": { "bytesBase64Encoded": "B64_ENCODED_IMAGE" } [...]

VIDEO_URI: The Cloud Storage URI of the target video to get embeddings for. For example, gs://my-bucket/embeddings/supermarket-video.mp4.
You can also provide the video as a base64-encoded byte string:

[...] "video": { "bytesBase64Encoded": "B64_ENCODED_VIDEO" } [...]

videoSegmentConfig (START_SECOND, END_SECOND, INTERVAL_SECONDS). Optional. The specific video segments (in seconds) the embeddings are generated for.
The value you set for videoSegmentConfig.intervalSec affects the pricing tier you are charged at. For more information, see the video embedding modes section and pricing page.

For example:

[...] "videoSegmentConfig": { "startOffsetSec": 10, "endOffsetSec": 60, "intervalSec": 10 } [...]

Using this config specifies video data from 10 seconds to 60 seconds and generates embeddings for the following 10 second video intervals: [10, 20), [20, 30), [30, 40), [40, 50), [50, 60). This video interval ("intervalSec": 10) falls in the Standard video embedding mode, and the user is charged at the Standard mode pricing rate.

If you omit videoSegmentConfig, the service uses the following default values: "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }. This video interval ("intervalSec": 16) falls in the Essential video embedding mode, and the user is charged at the Essential mode pricing rate.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

Request JSON body:

{ "instances": [ { "text": "TEXT", "image": { "gcsUri": "IMAGE_URI" }, "video": { "gcsUri": "VIDEO_URI", "videoSegmentConfig": { "startOffsetSec": START_SECOND, "endOffsetSec": END_SECOND, "intervalSec": INTERVAL_SECONDS } } } ] }

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

The embedding the model returns is a 1408 float vector. The following sample response is shortened for space.

{ "predictions": [ { "textEmbedding": [ 0.0105433334, -0.00302835181, 0.00656806398, 0.00603460241, [...] 0.00445805816, 0.0139605571, -0.00170318608, -0.00490092579 ], "videoEmbeddings": [ { "startOffsetSec": 0, "endOffsetSec": 7, "embedding": [ -0.00673126569, 0.0248149596, 0.0128901172, 0.0107588246, [...] -0.00180952181, -0.0054573305, 0.0117037306, 0.0169312079 ] } ], "imageEmbedding": [ -0.00728622358, 0.031021487, -0.00206603738, 0.0273937676, [...] -0.00204976718, 0.00321615417, 0.0121978866, 0.0193375275 ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

importvertexai fromvertexai.vision_modelsimport Image, MultiModalEmbeddingModel, Video fromvertexai.vision_modelsimport VideoSegmentConfig # TODO(developer): Update & uncomment line below # PROJECT_ID = "your-project-id" vertexai.init(project=PROJECT_ID, location="us-central1") model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001") image = Image.load_from_file( "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png" ) video = Video.load_from_file( "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4" ) embeddings = model.get_embeddings( image=image, video=video, video_segment_config=VideoSegmentConfig(end_offset_sec=1), contextual_text="Cars on Highway", ) print(f"Image Embedding: {embeddings.image_embedding}") # Video Embeddings are segmented based on the video_segment_config. print("Video Embeddings:") for video_embedding in embeddings.video_embeddings: print( f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}" ) print(f"Embedding: {video_embedding.embedding}") print(f"Text Embedding: {embeddings.text_embedding}") # Example response: # Image Embedding: [-0.0123144267, 0.0727186054, 0.000201397663, ...] # Video Embeddings: # Video Segment: 0.0 - 1.0 # Embedding: [-0.0206376351, 0.0345234685, ...] # Text Embedding: [-0.0207006838, -0.00251058186, ...]

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import( "context" "encoding/json" "fmt" "io" "time" aiplatform"cloud.google.com/go/aiplatform/apiv1beta1" aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb" "google.golang.org/api/option" "google.golang.org/protobuf/encoding/protojson" "google.golang.org/protobuf/types/known/structpb" ) // generateForImageTextAndVideo shows how to use the multimodal model to generate embeddings for // image, text and video data. funcgenerateForImageTextAndVideo(wio.Writer,project,locationstring)error{ // location = "us-central1" // The default context timeout may be not enough to process a video input. ctx,cancel:=context.WithTimeout(context.Background(),15*time.Second) defercancel() apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location) client,err:=aiplatform.NewPredictionClient (ctx,option.WithEndpoint(apiEndpoint)) iferr!=nil{ returnfmt.Errorf("failed to construct API client: %w",err) } deferclient.Close() model:="multimodalembedding@001" endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model) // This is the input to the model's prediction call. For schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body instance,err:=structpb.NewValue(map[string]any{ "text":"Domestic cats in natural conditions", "image":map[string]any{ // Image and video inputs can be provided either as a Google Cloud Storage URI or as // base64-encoded bytes using the "bytesBase64Encoded" field. "gcsUri":"gs://cloud-samples-data/generative-ai/image/320px-Felis_catus-cat_on_snow.jpg", }, "video":map[string]any{ "gcsUri":"gs://cloud-samples-data/video/cat.mp4", }, }) iferr!=nil{ returnfmt.Errorf("failed to construct request payload: %w",err) } req:=&aiplatformpb.PredictRequest{ Endpoint:endpoint, // The model supports only 1 instance per request. Instances:[]*structpb.Value{instance}, } resp,err:=client.Predict(ctx,req) iferr!=nil{ returnfmt.Errorf("failed to generate embeddings: %w",err) } instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0]) iferr!=nil{ returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err) } // For response schema, see: // https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body varinstanceEmbeddingsstruct{ ImageEmbeddings[]float32`json:"imageEmbedding"` TextEmbeddings[]float32`json:"textEmbedding"` VideoEmbeddings[]struct{ Embedding[]float32`json:"embedding"` StartOffsetSecfloat64`json:"startOffsetSec"` EndOffsetSecfloat64`json:"endOffsetSec"` }`json:"videoEmbeddings"` } iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{ returnfmt.Errorf("failed to unmarshal JSON: %w",err) } imageEmbedding:=instanceEmbeddings.ImageEmbeddings textEmbedding:=instanceEmbeddings.TextEmbeddings // Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per // each processed segment). videoEmbedding:=instanceEmbeddings.VideoEmbeddings[0].Embedding fmt.Fprintf(w,"Image embedding (length=%d): %v\n",len(imageEmbedding),imageEmbedding) fmt.Fprintf(w,"Text embedding (length=%d): %v\n",len(textEmbedding),textEmbedding) fmt.Fprintf(w,"Video embedding (length=%d): %v\n",len(videoEmbedding),videoEmbedding) // Example response: // Image embedding (length=1408): [-0.01558477 0.0258355 0.016342038 ... ] // Text embedding (length=1408): [-0.005894961 0.008349559 0.015355394 ... ] // Video embedding (length=1408): [-0.018867437 0.013997682 0.0012682161 ... ] returnnil }

What's next

Read the blog "What is Multimodal Search: 'LLMs with vision' change businesses".

For information about text-only use cases (text-based semantic search, clustering, long-form document analysis, and other text retrieval or question-answering use cases), read Get text embeddings.

View all Vertex AI image generative AI offerings in the Imagen on Vertex AI overview.

Explore more pretrained models in Model Garden.

Learn about responsible AI best practices and safety filters in Vertex AI.

Limit	Value and description
Text and image data
Maximum number of API requests per minute per project	120-600 depending on region
Maximum text length	32 tokens (~32 words) The maximum text length is 32 tokens (approximately 32 words). If the input exceeds 32 tokens, the model internally shortens the input to this length.
Language	English
Image formats	BMP, GIF, JPG, PNG
Image size	Base64-encoded images: 20 MB (when transcoded to PNG) Cloud Storage images: 20MB (original file format) The maximum image size accepted is 20 MB. To avoid increased network latency, use smaller images. Additionally, the model resizes images to 512 x 512 pixel resolution. Consequently, you don't need to provide higher resolution images.
Video data
Audio supported	N/A - The model doesn't consider audio content when generating video embeddings
Video formats	AVI, FLV, MKV, MOV, MP4, MPEG, MPG, WEBM, and WMV
Maximum video length (Cloud Storage)	No limit. However, only 2 minutes of content can be analyzed at a time.

Mode	Maximum number of embeddings per minute	Video embedding interval (minimum value)
Essential	4	15 This corresponds to: `intervalSec` >= 15
Standard	8	8 This corresponds to: 8 <= `intervalSec` < 15
Plus	15	4 This corresponds to: 4 <= `intervalSec` < 8

Get multimodal embeddings Stay organized with collections Save and categorize content based on your preferences.

Supported models

Best practices

API usage

API limits

Before you begin

Java

Node.js

Python

REST

Locations

Error messages

Quota exceeded error

Specify lower-dimension embeddings

REST

curl

PowerShell

Python

Go

Send an embedding request (image and text)

Get text and image embeddings

REST

curl

PowerShell

Python

Node.js

Java

Go

Send an embedding request (video, image, or text)

Video embedding modes

Video embeddings best practices

Get video embeddings

REST

curl

PowerShell

Python

Go

Get image, text, and video embeddings

REST

curl

PowerShell

Python

Go

What's next

Get multimodal embeddings