Multimodal embeddings API
Stay organized with collections
Save and categorize content based on your preferences.
The Multimodal embeddings API generates vectors based on the input you provide, which can include a combination of image, text, and video data. The embedding vectors can then be used for subsequent tasks like image classification or video content moderation.
For additional conceptual information, see Multimodal embeddings.
Supported Models:
| Model | Code |
|---|---|
| Embeddings for Multimodal | multimodalembedding@001 |
Example syntax
Syntax to send a multimodal embeddings API request.
curl
curl-XPOST\ -H"Authorization: Bearer $(gcloudauthprint-access-token)"\ -H"Content-Type: application/json"\ https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}:predict\ -d'{ "instances": [ ... ], }'
Python
fromvertexai.vision_modelsimport MultiModalEmbeddingModel model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding") model.get_embeddings(...)
Parameter list
See examples for implementation details.
Request Body
{
"instances":[
{
"text":string,
"image":{
// Union field can be only one of the following:
"bytesBase64Encoded":string,
"gcsUri":string,
// End of list of possible types for union field.
"mimeType":string
},
"video":{
// Union field can be only one of the following:
"bytesBase64Encoded":string,
"gcsUri":string,
// End of list of possible types for union field.
"videoSegmentConfig":{
"startOffsetSec":integer,
"endOffsetSec":integer,
"intervalSec":integer
}
},
"parameters":{
"dimension":integer
}
}
]
}
| Parameters | |
|---|---|
|
Optional: The image to generate embeddings for. |
|
Optional: The text to generate embeddings for. |
|
Optional: The video segment to generate embeddings for. |
|
Optional: The dimension of the embedding,
included in the response. Only applies to text and image input. Accepted
values: |
Image
| Parameters | |
|---|---|
|
Optional: Image bytes encoded in a base64 string. Must be one of |
|
Optional. The Cloud Storage location of the image to perform the embedding. One of |
|
Optional. The MIME type of the content of the image. Supported values: |
Video
| Parameters | |
|---|---|
|
Optional: Video bytes encoded in base64 string. One of |
|
Optional: The Cloud Storage location of the video on which to perform the embedding. One of |
|
Optional: The video segment config. |
VideoSegmentConfig
| Parameters | |
|---|---|
|
Optional: The start offset of the video segment in seconds. If not specified, it's calculated with |
|
Optional: The end offset of the video segment in seconds. If not specified, it's calculated with |
|
Optional. The interval of the video the embedding will be generated. The minimum value for |
Response body
{
"predictions":[
{
"textEmbedding":[
float,
// array of 128, 256, 512, or 1408 float values
float
],
"imageEmbedding":[
float,
// array of 128, 256, 512, or 1408 float values
float
],
"videoEmbeddings":[
{
"startOffsetSec":integer,
"endOffsetSec":integer,
"embedding":[
float,
// array of 1408 float values
float
]
}
]
}
],
"deployedModelId":string
}
| Response element | Description |
|---|---|
imageEmbedding |
128, 256, 512, or 1408 dimension list of floats. |
textEmbedding |
128, 256, 512, or 1408 dimension list of floats. |
videoEmbeddings |
1408 dimension list of floats with the start and end time (in seconds) of the video segment that the embeddings are generated for. |
Examples
Basic use case
Generate embeddings from image
Use the following sample to generate embeddings for an image.
REST
Before using any of the request data, make the following replacements:
- LOCATION: Your project's region. For example,
us-central1,europe-west2, orasia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations. - PROJECT_ID: Your Google Cloud project ID.
- TEXT: The target text to get embeddings for. For example,
a cat. - B64_ENCODED_IMG: The target image to get embeddings for. The image must be specified as a base64-encoded byte string.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict
Request JSON body:
{
"instances": [
{
"text": "TEXT",
"image": {
"bytesBase64Encoded": "B64_ENCODED_IMG"
}
}
]
}
To send your request, choose one of these options:
curl
Save the request body in a file named request.json,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"
PowerShell
Save the request body in a file named request.json,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content
{
"predictions": [
{
"textEmbedding": [
0.010477379,
-0.00399621,
0.00576670747,
[...]
-0.00823613815,
-0.0169572588,
-0.00472954148
],
"imageEmbedding": [
0.00262696808,
-0.00198890246,
0.0152047109,
-0.0103145819,
[...]
0.0324628279,
0.0284924973,
0.011650892,
-0.00452344026
]
}
],
"deployedModelId": "DEPLOYED_MODEL_ID"
}
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
importvertexai
fromvertexai.vision_modelsimport Image, MultiModalEmbeddingModel
# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")
model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
image = Image.load_from_file(
"gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)
embeddings = model.get_embeddings(
image=image,
contextual_text="Colosseum",
dimension=1408,
)
print(f"Image Embedding: {embeddings.image_embedding}")
print(f"Text Embedding: {embeddings.text_embedding}")
# Example response:
# Image Embedding: [-0.0123147098, 0.0727171078, ...]
# Text Embedding: [0.00230263756, 0.0278981831, ...]
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
/**
* TODO(developer): Uncomment these variables before running the sample.\
* (Not necessary if passing values as arguments)
*/
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';
// const baseImagePath = 'YOUR_BASE_IMAGE_PATH';
// const textPrompt = 'YOUR_TEXT_PROMPT';
constaiplatform=require('@google-cloud/aiplatform');
// Imports the Google Cloud Prediction service client
const{PredictionServiceClient}=aiplatform.v1;
// Import the helper module for converting arbitrary protobuf.Value objects.
const{helpers}=aiplatform;
// Specifies the location of the api endpoint
constclientOptions={
apiEndpoint:'us-central1-aiplatform.googleapis.com',
};
constpublisher='google';
constmodel='multimodalembedding@001';
// Instantiates a client
constpredictionServiceClient=newPredictionServiceClient (clientOptions);
asyncfunctionpredictImageFromImageAndText(){
// Configure the parent resource
constendpoint=`projects/${project}/locations/${location}/publishers/${publisher}/models/${model}`;
constfs=require('fs');
constimageFile=fs.readFileSync(baseImagePath);
// Convert the image data to a Buffer and base64 encode it.
constencodedImage=Buffer.from(imageFile).toString('base64');
constprompt={
text:textPrompt,
image:{
bytesBase64Encoded:encodedImage,
},
};
constinstanceValue=helpers .toValue(prompt);
constinstances=[instanceValue];
constparameter={
sampleCount:1,
};
constparameters=helpers .toValue(parameter);
constrequest={
endpoint,
instances,
parameters,
};
// Predict request
const[response]=awaitpredictionServiceClient.predict(request);
console.log('Get image embedding response');
constpredictions=response.predictions;
console.log('\tPredictions :');
for(constpredictionofpredictions){
console.log(`\t\tPrediction : ${JSON.stringify(prediction)}`);
}
}
awaitpredictImageFromImageAndText();Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
importcom.google.cloud.aiplatform.v1beta1.EndpointName;
importcom.google.cloud.aiplatform.v1beta1.PredictResponse;
importcom.google.cloud.aiplatform.v1beta1.PredictionServiceClient;
importcom.google.cloud.aiplatform.v1beta1.PredictionServiceSettings;
importcom.google.gson.Gson;
importcom.google.gson.JsonObject;
importcom.google.protobuf.InvalidProtocolBufferException ;
importcom.google.protobuf.Value ;
importcom.google.protobuf.util.JsonFormat ;
importjava.io.IOException;
importjava.nio.charset.StandardCharsets;
importjava.nio.file.Files;
importjava.nio.file.Paths;
importjava.util.ArrayList;
importjava.util.Base64;
importjava.util.HashMap;
importjava.util.List;
importjava.util.Map;
publicclass PredictImageFromImageAndTextSample{
publicstaticvoidmain(String[]args)throwsIOException{
// TODO(developer): Replace this variable before running the sample.
Stringproject="YOUR_PROJECT_ID";
StringtextPrompt="YOUR_TEXT_PROMPT";
StringbaseImagePath="YOUR_BASE_IMAGE_PATH";
// Learn how to use text prompts to update an image:
// https://cloud.google.com/vertex-ai/docs/generative-ai/image/edit-images
Map<String,Object>parameters=newHashMap<String,Object>();
parameters.put("sampleCount",1);
Stringlocation="us-central1";
Stringpublisher="google";
Stringmodel="multimodalembedding@001";
predictImageFromImageAndText(
project,location,publisher,model,textPrompt,baseImagePath,parameters);
}
// Update images using text prompts
publicstaticvoidpredictImageFromImageAndText(
Stringproject,
Stringlocation,
Stringpublisher,
Stringmodel,
StringtextPrompt,
StringbaseImagePath,
Map<String,Object>parameters)
throwsIOException{
finalStringendpoint=String.format("%s-aiplatform.googleapis.com:443",location);
finalPredictionServiceSettingspredictionServiceSettings=
PredictionServiceSettings.newBuilder().setEndpoint(endpoint).build();
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests.
try(PredictionServiceClientpredictionServiceClient=
PredictionServiceClient.create(predictionServiceSettings)){
finalEndpointNameendpointName=
EndpointName.ofProjectLocationPublisherModelName(project,location,publisher,model);
// Convert the image to Base64
byte[]imageData=Base64.getEncoder().encode(Files.readAllBytes(Paths.get(baseImagePath)));
StringencodedImage=newString(imageData,StandardCharsets.UTF_8);
JsonObjectjsonInstance=newJsonObject();
jsonInstance.addProperty("text",textPrompt);
JsonObjectjsonImage=newJsonObject();
jsonImage.addProperty("bytesBase64Encoded",encodedImage);
jsonInstance.add("image",jsonImage);
Value instanceValue=stringToValue(jsonInstance.toString());
List<Value> instances=newArrayList<>();
instances.add(instanceValue);
Gsongson=newGson();
StringgsonString=gson.toJson(parameters);
Value parameterValue=stringToValue(gsonString);
PredictResponsepredictResponse=
predictionServiceClient.predict(endpointName,instances,parameterValue);
System.out.println("Predict Response");
System.out.println(predictResponse);
for(Value prediction:predictResponse.getPredictionsList()){
System.out.format("\tPrediction: %s\n",prediction);
}
}
}
// Convert a Json string to a protobuf.Value
staticValue stringToValue(Stringvalue)throwsInvalidProtocolBufferException {
Value .Builderbuilder=Value .newBuilder();
JsonFormat .parser().merge(value,builder);
returnbuilder .build();
}
}Go
Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
import(
"context"
"encoding/json"
"fmt"
"io"
aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"
aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
"google.golang.org/api/option"
"google.golang.org/protobuf/encoding/protojson"
"google.golang.org/protobuf/types/known/structpb"
)
// generateForTextAndImage shows how to use the multimodal model to generate embeddings for
// text and image inputs.
funcgenerateForTextAndImage(wio.Writer,project,locationstring)error{
// location = "us-central1"
ctx:=context.Background()
apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)
client,err:=aiplatform.NewPredictionClient (ctx,option.WithEndpoint(apiEndpoint))
iferr!=nil{
returnfmt.Errorf("failed to construct API client: %w",err)
}
deferclient.Close()
model:="multimodalembedding@001"
endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model)
// This is the input to the model's prediction call. For schema, see:
// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
instance,err:=structpb.NewValue(map[string]any{
"image":map[string]any{
// Image input can be provided either as a Google Cloud Storage URI or as
// base64-encoded bytes using the "bytesBase64Encoded" field.
"gcsUri":"gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png",
},
"text":"Colosseum",
})
iferr!=nil{
returnfmt.Errorf("failed to construct request payload: %w",err)
}
req:=&aiplatformpb.PredictRequest{
Endpoint:endpoint,
// The model supports only 1 instance per request.
Instances:[]*structpb.Value{instance},
}
resp,err:=client.Predict(ctx,req)
iferr!=nil{
returnfmt.Errorf("failed to generate embeddings: %w",err)
}
instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0])
iferr!=nil{
returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err)
}
// For response schema, see:
// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
varinstanceEmbeddingsstruct{
ImageEmbeddings[]float32`json:"imageEmbedding"`
TextEmbeddings[]float32`json:"textEmbedding"`
}
iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{
returnfmt.Errorf("failed to unmarshal JSON: %w",err)
}
imageEmbedding:=instanceEmbeddings.ImageEmbeddings
textEmbedding:=instanceEmbeddings.TextEmbeddings
fmt.Fprintf(w,"Text embedding (length=%d): %v\n",len(textEmbedding),textEmbedding)
fmt.Fprintf(w,"Image embedding (length=%d): %v\n",len(imageEmbedding),imageEmbedding)
// Example response:
// Text embedding (length=1408): [0.0023026613 0.027898183 -0.011858357 ... ]
// Image embedding (length=1408): [-0.012314269 0.07271844 0.00020170923 ... ]
returnnil
}
Generate embeddings from video
Use the following sample to generating embeddings for video content.
REST
The following example uses a video located in Cloud Storage. You can
also use the video.bytesBase64Encoded field to provide a
base64-encoded string representation of the
video.
Before using any of the request data, make the following replacements:
- LOCATION: Your project's region. For example,
us-central1,europe-west2, orasia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations. - PROJECT_ID: Your Google Cloud project ID.
- VIDEO_URI: The Cloud Storage URI of the target video to get embeddings for.
For example,
gs://my-bucket/embeddings/supermarket-video.mp4.You can also provide the video as a base64-encoded byte string:
[...] "video": { "bytesBase64Encoded": "B64_ENCODED_VIDEO" } [...] videoSegmentConfig(START_SECOND, END_SECOND, INTERVAL_SECONDS). Optional. The specific video segments (in seconds) the embeddings are generated for.For example:
[...] "videoSegmentConfig": { "startOffsetSec": 10, "endOffsetSec": 60, "intervalSec": 10 } [...]Using this config specifies video data from 10 seconds to 60 seconds and generates embeddings for the following 10 second video intervals: [10, 20), [20, 30), [30, 40), [40, 50), [50, 60). This video interval (
"intervalSec": 10) falls in the Standard video embedding mode, and the user is charged at the Standard mode pricing rate.If you omit
videoSegmentConfig, the service uses the following default values:"videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }. This video interval ("intervalSec": 16) falls in the Essential video embedding mode, and the user is charged at the Essential mode pricing rate.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict
Request JSON body:
{
"instances": [
{
"video": {
"gcsUri": "VIDEO_URI",
"videoSegmentConfig": {
"startOffsetSec": START_SECOND,
"endOffsetSec": END_SECOND,
"intervalSec": INTERVAL_SECONDS
}
}
}
]
}
To send your request, choose one of these options:
curl
Save the request body in a file named request.json,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"
PowerShell
Save the request body in a file named request.json,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content
Response (7 second video, no videoSegmentConfig specified):
{
"predictions": [
{
"videoEmbeddings": [
{
"endOffsetSec": 7,
"embedding": [
-0.0045467657,
0.0258095954,
0.0146885719,
0.00945400633,
[...]
-0.0023291884,
-0.00493789,
0.00975185353,
0.0168156829
],
"startOffsetSec": 0
}
]
}
],
"deployedModelId": "DEPLOYED_MODEL_ID"
}Response (59 second video, with the following video segment config: "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 60, "intervalSec": 10 }):
{
"predictions": [
{
"videoEmbeddings": [
{
"endOffsetSec": 10,
"startOffsetSec": 0,
"embedding": [
-0.00683252793,
0.0390476175,
[...]
0.00657121744,
0.013023301
]
},
{
"startOffsetSec": 10,
"endOffsetSec": 20,
"embedding": [
-0.0104404651,
0.0357737206,
[...]
0.00509833824,
0.0131902946
]
},
{
"startOffsetSec": 20,
"embedding": [
-0.0113538112,
0.0305239167,
[...]
-0.00195809244,
0.00941874553
],
"endOffsetSec": 30
},
{
"embedding": [
-0.00299320649,
0.0322436653,
[...]
-0.00993082579,
0.00968887936
],
"startOffsetSec": 30,
"endOffsetSec": 40
},
{
"endOffsetSec": 50,
"startOffsetSec": 40,
"embedding": [
-0.00591270532,
0.0368893594,
[...]
-0.00219071587,
0.0042470959
]
},
{
"embedding": [
-0.00458270218,
0.0368121453,
[...]
-0.00317760976,
0.00595594104
],
"endOffsetSec": 59,
"startOffsetSec": 50
}
]
}
],
"deployedModelId": "DEPLOYED_MODEL_ID"
}
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
importvertexai
fromvertexai.vision_modelsimport MultiModalEmbeddingModel, Video
fromvertexai.vision_modelsimport VideoSegmentConfig
# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")
model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
embeddings = model.get_embeddings(
video=Video.load_from_file(
"gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4"
),
video_segment_config=VideoSegmentConfig(end_offset_sec=1),
)
# Video Embeddings are segmented based on the video_segment_config.
print("Video Embeddings:")
for video_embedding in embeddings.video_embeddings:
print(
f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}"
)
print(f"Embedding: {video_embedding.embedding}")
# Example response:
# Video Embeddings:
# Video Segment: 0.0 - 1.0
# Embedding: [-0.0206376351, 0.0123456789, ...]
Go
Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
import(
"context"
"encoding/json"
"fmt"
"io"
"time"
aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"
aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
"google.golang.org/api/option"
"google.golang.org/protobuf/encoding/protojson"
"google.golang.org/protobuf/types/known/structpb"
)
// generateForVideo shows how to use the multimodal model to generate embeddings for video input.
funcgenerateForVideo(wio.Writer,project,locationstring)error{
// location = "us-central1"
// The default context timeout may be not enough to process a video input.
ctx,cancel:=context.WithTimeout(context.Background(),15*time.Second)
defercancel()
apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)
client,err:=aiplatform.NewPredictionClient (ctx,option.WithEndpoint(apiEndpoint))
iferr!=nil{
returnfmt.Errorf("failed to construct API client: %w",err)
}
deferclient.Close()
model:="multimodalembedding@001"
endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model)
// This is the input to the model's prediction call. For schema, see:
// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
instances,err:=structpb.NewValue(map[string]any{
"video":map[string]any{
// Video input can be provided either as a Google Cloud Storage URI or as base64-encoded
// bytes using the "bytesBase64Encoded" field.
"gcsUri":"gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4",
"videoSegmentConfig":map[string]any{
"startOffsetSec":1,
"endOffsetSec":5,
},
},
})
iferr!=nil{
returnfmt.Errorf("failed to construct request payload: %w",err)
}
req:=&aiplatformpb.PredictRequest{
Endpoint:endpoint,
// The model supports only 1 instance per request.
Instances:[]*structpb.Value{instances},
}
resp,err:=client.Predict(ctx,req)
iferr!=nil{
returnfmt.Errorf("failed to generate embeddings: %w",err)
}
instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0])
iferr!=nil{
returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err)
}
// For response schema, see:
// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
varinstanceEmbeddingsstruct{
VideoEmbeddings[]struct{
Embedding[]float32`json:"embedding"`
StartOffsetSecfloat64`json:"startOffsetSec"`
EndOffsetSecfloat64`json:"endOffsetSec"`
}`json:"videoEmbeddings"`
}
iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{
returnfmt.Errorf("failed to unmarshal json: %w",err)
}
// Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per
// each processed segment).
videoEmbedding:=instanceEmbeddings.VideoEmbeddings[0]
fmt.Fprintf(w,"Video embedding (seconds: %.f-%.f; length=%d): %v\n",
videoEmbedding.StartOffsetSec,
videoEmbedding.EndOffsetSec,
len(videoEmbedding.Embedding),
videoEmbedding.Embedding,
)
// Example response:
// Video embedding (seconds: 1-5; length=1408): [-0.016427778 0.032878537 -0.030755188 ... ]
returnnil
}
Advanced use case
Use the following sample to get embeddings for video, text, and image content.
For video embedding, you can specify the video segment and embedding density.
REST
The following example uses image, text, and video data. You can use any combination of these data types in your request body.
This sample uses a video located in Cloud Storage. You can
also use the video.bytesBase64Encoded field to provide a
base64-encoded string representation of the
video.
Before using any of the request data, make the following replacements:
- LOCATION: Your project's region. For example,
us-central1,europe-west2, orasia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations. - PROJECT_ID: Your Google Cloud project ID.
- TEXT: The target text to get embeddings for. For example,
a cat. - IMAGE_URI: The Cloud Storage URI of the target image to get embeddings for.
For example,
gs://my-bucket/embeddings/supermarket-img.png.You can also provide the image as a base64-encoded byte string:
[...] "image": { "bytesBase64Encoded": "B64_ENCODED_IMAGE" } [...] - VIDEO_URI: The Cloud Storage URI of the target video to get embeddings for.
For example,
gs://my-bucket/embeddings/supermarket-video.mp4.You can also provide the video as a base64-encoded byte string:
[...] "video": { "bytesBase64Encoded": "B64_ENCODED_VIDEO" } [...] videoSegmentConfig(START_SECOND, END_SECOND, INTERVAL_SECONDS). Optional. The specific video segments (in seconds) the embeddings are generated for.For example:
[...] "videoSegmentConfig": { "startOffsetSec": 10, "endOffsetSec": 60, "intervalSec": 10 } [...]Using this config specifies video data from 10 seconds to 60 seconds and generates embeddings for the following 10 second video intervals: [10, 20), [20, 30), [30, 40), [40, 50), [50, 60). This video interval (
"intervalSec": 10) falls in the Standard video embedding mode, and the user is charged at the Standard mode pricing rate.If you omit
videoSegmentConfig, the service uses the following default values:"videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }. This video interval ("intervalSec": 16) falls in the Essential video embedding mode, and the user is charged at the Essential mode pricing rate.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict
Request JSON body:
{
"instances": [
{
"text": "TEXT",
"image": {
"gcsUri": "IMAGE_URI"
},
"video": {
"gcsUri": "VIDEO_URI",
"videoSegmentConfig": {
"startOffsetSec": START_SECOND,
"endOffsetSec": END_SECOND,
"intervalSec": INTERVAL_SECONDS
}
}
}
]
}
To send your request, choose one of these options:
curl
Save the request body in a file named request.json,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"
PowerShell
Save the request body in a file named request.json,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content
{
"predictions": [
{
"textEmbedding": [
0.0105433334,
-0.00302835181,
0.00656806398,
0.00603460241,
[...]
0.00445805816,
0.0139605571,
-0.00170318608,
-0.00490092579
],
"videoEmbeddings": [
{
"startOffsetSec": 0,
"endOffsetSec": 7,
"embedding": [
-0.00673126569,
0.0248149596,
0.0128901172,
0.0107588246,
[...]
-0.00180952181,
-0.0054573305,
0.0117037306,
0.0169312079
]
}
],
"imageEmbedding": [
-0.00728622358,
0.031021487,
-0.00206603738,
0.0273937676,
[...]
-0.00204976718,
0.00321615417,
0.0121978866,
0.0193375275
]
}
],
"deployedModelId": "DEPLOYED_MODEL_ID"
}
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
importvertexai
fromvertexai.vision_modelsimport Image, MultiModalEmbeddingModel, Video
fromvertexai.vision_modelsimport VideoSegmentConfig
# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")
model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
image = Image.load_from_file(
"gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)
video = Video.load_from_file(
"gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4"
)
embeddings = model.get_embeddings(
image=image,
video=video,
video_segment_config=VideoSegmentConfig(end_offset_sec=1),
contextual_text="Cars on Highway",
)
print(f"Image Embedding: {embeddings.image_embedding}")
# Video Embeddings are segmented based on the video_segment_config.
print("Video Embeddings:")
for video_embedding in embeddings.video_embeddings:
print(
f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}"
)
print(f"Embedding: {video_embedding.embedding}")
print(f"Text Embedding: {embeddings.text_embedding}")
# Example response:
# Image Embedding: [-0.0123144267, 0.0727186054, 0.000201397663, ...]
# Video Embeddings:
# Video Segment: 0.0 - 1.0
# Embedding: [-0.0206376351, 0.0345234685, ...]
# Text Embedding: [-0.0207006838, -0.00251058186, ...]
Go
Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
import(
"context"
"encoding/json"
"fmt"
"io"
"time"
aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"
aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
"google.golang.org/api/option"
"google.golang.org/protobuf/encoding/protojson"
"google.golang.org/protobuf/types/known/structpb"
)
// generateForImageTextAndVideo shows how to use the multimodal model to generate embeddings for
// image, text and video data.
funcgenerateForImageTextAndVideo(wio.Writer,project,locationstring)error{
// location = "us-central1"
// The default context timeout may be not enough to process a video input.
ctx,cancel:=context.WithTimeout(context.Background(),15*time.Second)
defercancel()
apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)
client,err:=aiplatform.NewPredictionClient (ctx,option.WithEndpoint(apiEndpoint))
iferr!=nil{
returnfmt.Errorf("failed to construct API client: %w",err)
}
deferclient.Close()
model:="multimodalembedding@001"
endpoint:=fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s",project,location,model)
// This is the input to the model's prediction call. For schema, see:
// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
instance,err:=structpb.NewValue(map[string]any{
"text":"Domestic cats in natural conditions",
"image":map[string]any{
// Image and video inputs can be provided either as a Google Cloud Storage URI or as
// base64-encoded bytes using the "bytesBase64Encoded" field.
"gcsUri":"gs://cloud-samples-data/generative-ai/image/320px-Felis_catus-cat_on_snow.jpg",
},
"video":map[string]any{
"gcsUri":"gs://cloud-samples-data/video/cat.mp4",
},
})
iferr!=nil{
returnfmt.Errorf("failed to construct request payload: %w",err)
}
req:=&aiplatformpb.PredictRequest{
Endpoint:endpoint,
// The model supports only 1 instance per request.
Instances:[]*structpb.Value{instance},
}
resp,err:=client.Predict(ctx,req)
iferr!=nil{
returnfmt.Errorf("failed to generate embeddings: %w",err)
}
instanceEmbeddingsJson,err:=protojson.Marshal(resp.GetPredictions()[0])
iferr!=nil{
returnfmt.Errorf("failed to convert protobuf value to JSON: %w",err)
}
// For response schema, see:
// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
varinstanceEmbeddingsstruct{
ImageEmbeddings[]float32`json:"imageEmbedding"`
TextEmbeddings[]float32`json:"textEmbedding"`
VideoEmbeddings[]struct{
Embedding[]float32`json:"embedding"`
StartOffsetSecfloat64`json:"startOffsetSec"`
EndOffsetSecfloat64`json:"endOffsetSec"`
}`json:"videoEmbeddings"`
}
iferr:=json.Unmarshal(instanceEmbeddingsJson,&instanceEmbeddings);err!=nil{
returnfmt.Errorf("failed to unmarshal JSON: %w",err)
}
imageEmbedding:=instanceEmbeddings.ImageEmbeddings
textEmbedding:=instanceEmbeddings.TextEmbeddings
// Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per
// each processed segment).
videoEmbedding:=instanceEmbeddings.VideoEmbeddings[0].Embedding
fmt.Fprintf(w,"Image embedding (length=%d): %v\n",len(imageEmbedding),imageEmbedding)
fmt.Fprintf(w,"Text embedding (length=%d): %v\n",len(textEmbedding),textEmbedding)
fmt.Fprintf(w,"Video embedding (length=%d): %v\n",len(videoEmbedding),videoEmbedding)
// Example response:
// Image embedding (length=1408): [-0.01558477 0.0258355 0.016342038 ... ]
// Text embedding (length=1408): [-0.005894961 0.008349559 0.015355394 ... ]
// Video embedding (length=1408): [-0.018867437 0.013997682 0.0012682161 ... ]
returnnil
}
What's next
For detailed documentation, see the following: