Gen AI evaluation service API
Stay organized with collections
Save and categorize content based on your preferences.
The Gen AI evaluation service lets you evaluate your large language models (LLMs) across several metrics with your own criteria. You can provide inference-time inputs, LLM responses and additional parameters, and the Gen AI evaluation service returns metrics specific to the evaluation task.
Metrics include model-based metrics, such as PointwiseMetric and PairwiseMetric, and in-memory
computed metrics, such as rouge, bleu, and tool function-call metrics.
PointwiseMetric and PairwiseMetric are generic model-based metrics that
you can customize with your own criteria.
Because the service takes the prediction results directly from models as input,
the evaluation service can perform both inference and subsequent evaluation on
all models supported by
Vertex AI.
For more information on evaluating a model, see Gen AI evaluation service overview.
Limitations
The following are limitations of the evaluation service:
- The evaluation service may have a propagation delay in your first call.
- Most model-based metrics consume
gemini-2.0-flash quota
because the Gen AI evaluation service leverages
gemini-2.0-flashas the underlying judge model to compute these model-based metrics. - Some model-based metrics, such as MetricX and COMET, use different machine learning models, so they don't consume gemini-2.0-flash quota.
Example syntax
Syntax to send an evaluation call.
curl
curl-XPOST\ -H"Authorization: Bearer $(gcloudauthprint-access-token)"\ -H"Content-Type: application/json"\ https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances\ -d'{ "pointwise_metric_input" : { "metric_spec" : { ... }, "instance": { ... }, } }'
Python
importjson fromgoogleimport auth fromgoogle.api_coreimport exceptions fromgoogle.auth.transportimport requests as google_auth_requests creds, _ = auth.default( scopes=['https://www.googleapis.com/auth/cloud-platform']) data = { ... } uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances' result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data) print(json.dumps(result.json(), indent=2))
Parameter list
| Parameters | |
|---|---|
|
Optional: Input to assess if the prediction matches the reference exactly. |
|
Optional: Input to compute BLEU score by comparing the prediction against the reference. |
|
Optional: Input to compute |
|
Optional: Input to assess a single response's language mastery. |
|
Optional: Input to assess a single response's ability to provide a coherent, easy-to-follow reply. |
|
Optional: Input to assess a single response's level of safety. |
|
Optional: Input to assess a single response's ability to provide or reference information included only in the input text. |
|
Optional: Input to assess a single response's ability to completely fulfill instructions. |
|
Optional: Input to assess a single response's overall ability to summarize text. |
|
Optional: Input to compare two responses' overall summarization quality. |
|
Optional: Input to assess a single response's ability to provide a summarization, which contains the details necessary to substitute the original text. |
|
Optional: Input to assess a single response's ability to provide a succinct summarization. |
|
Optional: Input to assess a single response's overall ability to answer questions, given a body of text to reference. |
|
Optional: Input to compare two responses' overall ability to answer questions, given a body of text to reference. |
|
Optional: Input to assess a single response's ability to respond with relevant information when asked a question. |
|
Optional: Input to assess a single response's ability to provide key details when answering a question. |
|
Optional: Input to assess a single response's ability to correctly answer a question. |
|
Optional: Input for a generic pointwise evaluation. |
|
Optional: Input for a generic pairwise evaluation. |
|
Optional: Input to assess a single response's ability to predict a valid tool call. |
|
Optional: Input to assess a single response's ability to predict a tool call with the right tool name. |
|
Optional: Input to assess a single response's ability to predict a tool call with correct parameter names. |
|
Optional: Input to assess a single response's ability to predict a tool call with correct parameter names and values |
|
Optional: Input to evaluate using COMET. |
|
Optional: Input to evaluate using MetricX. |
ExactMatchInput
{ "exact_match_input":{ "metric_spec":{}, "instances":[ { "prediction":string, "reference":string } ] } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
ExactMatchResults
{ "exact_match_results":{ "exact_match_metric_values":[ { "score":float } ] } }
| Output | |
|---|---|
|
Evaluation results per instance input. |
|
One of the following:
|
BleuInput
{ "bleu_input":{ "metric_spec":{ "use_effective_order":bool }, "instances":[ { "prediction":string, "reference":string } ] } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Whether to take into account n-gram orders without any match. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
BleuResults
{ "bleu_results":{ "bleu_metric_values":[ { "score":float } ] } }
| Output | |
|---|---|
|
Evaluation results per instance input. |
|
|
RougeInput
{ "rouge_input":{ "metric_spec":{ "rouge_type":string, "use_stemmer":bool, "split_summaries":bool }, "instances":[ { "prediction":string, "reference":string } ] } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Acceptable values:
|
|
Optional: Whether Porter stemmer should be used to strip word suffixes to improve matching. |
|
Optional: Whether to add newlines between sentences for rougeLsum. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
RougeResults
{ "rouge_results":{ "rouge_metric_values":[ { "score":float } ] } }
| Output | |
|---|---|
|
Evaluation results per instance input. |
|
|
FluencyInput
{ "fluency_input":{ "metric_spec":{}, "instance":{ "prediction":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
FluencyResult
{ "fluency_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
CoherenceInput
{ "coherence_input":{ "metric_spec":{}, "instance":{ "prediction":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
CoherenceResult
{ "coherence_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
SafetyInput
{ "safety_input":{ "metric_spec":{}, "instance":{ "prediction":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
SafetyResult
{ "safety_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
GroundednessInput
{ "groundedness_input":{ "metric_spec":{}, "instance":{ "prediction":string, "context":string } } }
Parameter
Description
metric_spec
Optional: GroundednessSpec
Metric spec, defining the metric's behavior.
instance
Optional: GroundednessInstance
Evaluation input, consisting of inference inputs and corresponding response.
instance.prediction
Optional: string
LLM response.
instance.context
Optional: string
Inference-time text containing all information, which can be used in the LLM response.
GroundednessResult
{ "groundedness_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
FulfillmentInput
{ "fulfillment_input":{ "metric_spec":{}, "instance":{ "prediction":string, "instruction":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
FulfillmentResult
{ "fulfillment_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
SummarizationQualityInput
{ "summarization_quality_input":{ "metric_spec":{}, "instance":{ "prediction":string, "instruction":string, "context":string, } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationQualityResult
{ "summarization_quality_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
PairwiseSummarizationQualityInput
{ "pairwise_summarization_quality_input":{ "metric_spec":{}, "instance":{ "baseline_prediction":string, "prediction":string, "instruction":string, "context":string, } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: Baseline model LLM response. |
|
Optional: Candidate model LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
PairwiseSummarizationQualityResult
{ "pairwise_summarization_quality_result":{ "pairwise_choice":PairwiseChoice, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
SummarizationHelpfulnessInput
{ "summarization_helpfulness_input":{ "metric_spec":{}, "instance":{ "prediction":string, "instruction":string, "context":string, } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationHelpfulnessResult
{ "summarization_helpfulness_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
SummarizationVerbosityInput
{ "summarization_verbosity_input":{ "metric_spec":{}, "instance":{ "prediction":string, "instruction":string, "context":string, } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationVerbosityResult
{ "summarization_verbosity_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
QuestionAnsweringQualityInput
{ "question_answering_quality_input":{ "metric_spec":{}, "instance":{ "prediction":string, "instruction":string, "context":string, } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringQualityResult
{ "question_answering_quality_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
PairwiseQuestionAnsweringQualityInput
{ "pairwise_question_answering_quality_input":{ "metric_spec":{}, "instance":{ "baseline_prediction":string, "prediction":string, "instruction":string, "context":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: Baseline model LLM response. |
|
Optional: Candidate model LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
PairwiseQuestionAnsweringQualityResult
{ "pairwise_question_answering_quality_result":{ "pairwise_choice":PairwiseChoice, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
QuestionAnsweringRelevanceInput
{ "question_answering_quality_input":{ "metric_spec":{}, "instance":{ "prediction":string, "instruction":string, "context":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringRelevancyResult
{ "question_answering_relevancy_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
QuestionAnsweringHelpfulnessInput
{ "question_answering_helpfulness_input":{ "metric_spec":{}, "instance":{ "prediction":string, "instruction":string, "context":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringHelpfulnessResult
{ "question_answering_helpfulness_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
QuestionAnsweringCorrectnessInput
{ "question_answering_correctness_input":{ "metric_spec":{ "use_reference":bool }, "instance":{ "prediction":string, "reference":string, "instruction":string, "context":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: If reference is used or not in the evaluation. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringCorrectnessResult
{ "question_answering_correctness_result":{ "score":float, "explanation":string, "confidence":float } }
| Output | |
|---|---|
|
|
|
|
|
|
PointwiseMetricInput
{ "pointwise_metric_input":{ "metric_spec":{ "metric_prompt_template":string }, "instance":{ "json_instance":string, } } }
| Parameters | |
|---|---|
|
Required: Metric spec, defining the metric's behavior. |
|
Required: A prompt template defining the metric. It is rendered by the key-value pairs in instance.json_instance |
|
Required: Evaluation input, consisting of json_instance. |
|
Optional: The key-value pairs in Json format. For example, {"key_1": "value_1", "key_2": "value_2"}. It is used to render metric_spec.metric_prompt_template. |
PointwiseMetricResult
{ "pointwise_metric_result":{ "score":float, "explanation":string, } }
| Output | |
|---|---|
|
|
|
|
PairwiseMetricInput
{ "pairwise_metric_input":{ "metric_spec":{ "metric_prompt_template":string }, "instance":{ "json_instance":string, } } }
| Parameters | |
|---|---|
|
Required: Metric spec, defining the metric's behavior. |
|
Required: A prompt template defining the metric. It is rendered by the key-value pairs in instance.json_instance |
|
Required: Evaluation input, consisting of json_instance. |
|
Optional: The key-value pairs in JSON format. For example, {"key_1": "value_1", "key_2": "value_2"}. It is used to render metric_spec.metric_prompt_template. |
PairwiseMetricResult
{ "pairwise_metric_result":{ "score":float, "explanation":string, } }
| Output | |
|---|---|
|
|
|
|
ToolCallValidInput
{ "tool_call_valid_input":{ "metric_spec":{}, "instance":{ "prediction":string, "reference":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains { "content":"", "tool_calls":[ { "name":"book_tickets", "arguments":{ "movie":"Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location":"Mountain View CA", "showtime":"7:30", "date":"2024-03-30", "num_tix":"2" } } ] } |
|
Optional: Golden model output in the same format as prediction. |
ToolCallValidResults
{ "tool_call_valid_results":{ "tool_call_valid_metric_values":[ { "score":float } ] } }
| Output | |
|---|---|
|
repeated |
|
|
ToolNameMatchInput
{ "tool_name_match_input":{ "metric_spec":{}, "instance":{ "prediction":string, "reference":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolNameMatchResults
{ "tool_name_match_results":{ "tool_name_match_metric_values":[ { "score":float } ] } }
| Output | |
|---|---|
|
repeated |
|
|
ToolParameterKeyMatchInput
{ "tool_parameter_key_match_input":{ "metric_spec":{}, "instance":{ "prediction":string, "reference":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolParameterKeyMatchResults
{ "tool_parameter_key_match_results":{ "tool_parameter_key_match_metric_values":[ { "score":float } ] } }
| Output | |
|---|---|
|
repeated |
|
|
ToolParameterKVMatchInput
{ "tool_parameter_kv_match_input":{ "metric_spec":{}, "instance":{ "prediction":string, "reference":string } } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolParameterKVMatchResults
{ "tool_parameter_kv_match_results":{ "tool_parameter_kv_match_metric_values":[ { "score":float } ] } }
| Output | |
|---|---|
|
repeated |
|
|
CometInput
{ "comet_input":{ "metric_spec":{ "version":string }, "instance":{ "prediction":string, "source":string, "reference":string, }, } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional:
|
|
Optional: Source language in BCP-47 format. For example, "es". |
|
Optional: Target language in BCP-47 format. For example, "es" |
|
Optional: Evaluation input, consisting of LLM response and reference. The exact fields used for evaluation are dependent on the COMET version. |
|
Optional: Candidate model LLM response. This is the output of the LLM which is being evaluated. |
|
Optional: Source text. This is in the original language that the prediction was translated from. |
|
Optional: Ground truth used to compare against the prediction. This is in the same language as the prediction. |
CometResult
{ "comet_result":{ "score":float } }
| Output | |
|---|---|
|
|
MetricxInput
{ "metricx_input":{ "metric_spec":{ "version":string }, "instance":{ "prediction":string, "source":string, "reference":string, }, } }
| Parameters | |
|---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional:
One of the following:
|
|
Optional: Source language in BCP-47 format. For example, "es". |
|
Optional: Target language in BCP-47 format. For example, "es". |
|
Optional: Evaluation input, consisting of LLM response and reference. The exact fields used for evaluation are dependent on the MetricX version. |
|
Optional: Candidate model LLM response. This is the output of the LLM which is being evaluated. |
|
Optional: Source text which is in the original language that the prediction was translated from. |
|
Optional: Ground truth used to compare against the prediction. It is in the same language as the prediction. |
MetricxResult
{ "metricx_result":{ "score":float } }
| Output | |
|---|---|
|
|
Examples
Evaluate an output
The following example demonstrates how to call the Gen AI Evaluation API to evaluate the output of an LLM using a variety of evaluation metrics, including the following:
summarization_qualitygroundednessfulfillmentsummarization_helpfulnesssummarization_verbosity
Python
importpandasaspd
importvertexai
fromvertexai.preview.evaluationimport EvalTask, MetricPromptTemplateExamples
# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")
eval_dataset = pd.DataFrame(
{
"instruction": [
"Summarize the text in one sentence.",
"Summarize the text such that a five-year-old can understand.",
],
"context": [
"""As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.""",
"""A team of archaeologists has unearthed ancient artifacts shedding light on a
previously unknown civilization. The findings challenge existing historical
narratives and provide valuable insights into human history.""",
],
"response": [
"A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
"Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
],
}
)
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[
MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
MetricPromptTemplateExamples.Pointwise.VERBOSITY,
MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
],
)
prompt_template = (
"Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)
print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
print(f"{key}: \t{value}")
print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count: 2
# summarization_quality/mean: 3.5
# summarization_quality/std: 2.1213203435596424
# ...
Go
import(
context_pkg"context"
"fmt"
"io"
aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"
aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
"google.golang.org/api/option"
)
// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
funcevaluateModelResponse(wio.Writer,projectID,locationstring)error{
// location = "us-central1"
ctx:=context_pkg.Background()
apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)
client,err:=aiplatform.NewEvaluationClient (ctx,option.WithEndpoint(apiEndpoint))
iferr!=nil{
returnfmt.Errorf("unable to create aiplatform client: %w",err)
}
deferclient.Close()
// evaluate the pre-generated model response against the reference (ground truth)
responseToEvaluate:=`
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
reference:=`
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
req:=aiplatformpb.EvaluateInstancesRequest{
Location:fmt.Sprintf("projects/%s/locations/%s",projectID,location),
// Check the API reference for a full list of supported metric inputs:
// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
MetricInputs:&aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
GroundednessInput:&aiplatformpb.GroundednessInput{
MetricSpec:&aiplatformpb.GroundednessSpec{},
Instance:&aiplatformpb.GroundednessInstance{
Context:&reference,
Prediction:&responseToEvaluate,
},
},
},
}
resp,err:=client.EvaluateInstances(ctx,&req)
iferr!=nil{
returnfmt.Errorf("evaluateInstances failed: %v",err)
}
results:=resp.GetGroundednessResult()
fmt.Fprintf(w,"score: %.2f\n",results.GetScore())
fmt.Fprintf(w,"confidence: %.2f\n",results.GetConfidence())
fmt.Fprintf(w,"explanation:\n%s\n",results.GetExplanation())
// Example response:
// score: 1.00
// confidence: 1.00
// explanation:
// STEP 1: All aspects of the response are found in the context.
// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.
returnnil
}
Evaluate an output: pairwise summarization quality
The following example demonstrates how to call the Gen AI evaluation service API to evaluate the output of an LLM using a pairwise summarization quality comparison.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: .
- LOCATION: The region to process the request.
- PREDICTION: LLM response.
- BASELINE_PREDICTION: Baseline model LLM response.
- INSTRUCTION: The instruction used at inference time.
- CONTEXT: Inference-time text containing all relevant information, that can be used in the LLM response.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \
Request JSON body:
{
"pairwise_summarization_quality_input": {
"metric_spec": {},
"instance": {
"prediction": "PREDICTION",
"baseline_prediction": "BASELINE_PREDICTION",
"instruction": "INSTRUCTION",
"context": "CONTEXT",
}
}
}
To send your request, choose one of these options:
curl
Save the request body in a file named request.json,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"
PowerShell
Save the request body in a file named request.json,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content
Python
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
importpandasaspd
importvertexai
fromvertexai.generative_modelsimport GenerativeModel
fromvertexai.evaluationimport (
EvalTask,
PairwiseMetric,
MetricPromptTemplateExamples,
)
# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")
prompt = """
Summarize the text such that a five-year-old can understand.
# Text
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""
eval_dataset = pd.DataFrame({"prompt": [prompt]})
# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")
# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
"gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)
prompt_template = MetricPromptTemplateExamples.get_prompt_template(
"pairwise_summarization_quality"
)
summarization_quality_metric = PairwiseMetric(
metric="pairwise_summarization_quality",
metric_prompt_template=prompt_template,
baseline_model=baseline_model,
)
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[summarization_quality_metric],
experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)
baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
"pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
"pairwise_summarization_quality/explanation"
].iloc[0]
print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...Go
Go
Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
import(
context_pkg"context"
"fmt"
"io"
aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"
aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
"google.golang.org/api/option"
)
// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
funcpairwiseEvaluation(wio.Writer,projectID,locationstring)error{
// location = "us-central1"
ctx:=context_pkg.Background()
apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)
client,err:=aiplatform.NewEvaluationClient (ctx,option.WithEndpoint(apiEndpoint))
iferr!=nil{
returnfmt.Errorf("unable to create aiplatform client: %w",err)
}
deferclient.Close()
context:=`
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
instruction:="Summarize the text such that a five-year-old can understand."
baselineResponse:=`
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
candidateResponse:=`
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`
req:=aiplatformpb.EvaluateInstancesRequest{
Location:fmt.Sprintf("projects/%s/locations/%s",projectID,location),
MetricInputs:&aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
PairwiseSummarizationQualityInput:&aiplatformpb.PairwiseSummarizationQualityInput{
MetricSpec:&aiplatformpb.PairwiseSummarizationQualitySpec{},
Instance:&aiplatformpb.PairwiseSummarizationQualityInstance{
Context:&context,
Instruction:&instruction,
Prediction:&candidateResponse,
BaselinePrediction:&baselineResponse,
},
},
},
}
resp,err:=client.EvaluateInstances(ctx,&req)
iferr!=nil{
returnfmt.Errorf("evaluateInstances failed: %v",err)
}
results:=resp.GetPairwiseSummarizationQualityResult()
fmt.Fprintf(w,"choice: %s\n",results.GetPairwiseChoice())
fmt.Fprintf(w,"confidence: %.2f\n",results.GetConfidence())
fmt.Fprintf(w,"explanation:\n%s\n",results.GetExplanation())
// Example response:
// choice: BASELINE
// confidence: 0.50
// explanation:
// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...
returnnil
}
Get ROUGE score
The following example calls the Gen AI evaluation service API to get the ROUGE score
of a prediction, generated by a number of inputs. The ROUGE inputs use
metric_spec, which determines the metric's behavior.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: .
- LOCATION: The region to process the request.
- PREDICTION: LLM response.
- REFERENCE: Golden LLM response for reference.
- ROUGE_TYPE: The calculation used to determine the rouge score. See
metric_spec.rouge_typefor acceptable values. - USE_STEMMER: Determines whether the Porter stemmer is used to strip word suffixes to improve matching. For acceptable values, see
metric_spec.use_stemmer. - SPLIT_SUMMARIES: Determines if new lines are added between
rougeLsumsentences. For acceptable values, seemetric_spec.split_summaries.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \
Request JSON body:
{
"rouge_input": {
"instances": {
"prediction": "PREDICTION",
"reference": "REFERENCE.",
},
"metric_spec": {
"rouge_type": "ROUGE_TYPE",
"use_stemmer": USE_STEMMER,
"split_summaries": SPLIT_SUMMARIES,
}
}
}
To send your request, choose one of these options:
curl
Save the request body in a file named request.json,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"
PowerShell
Save the request body in a file named request.json,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content
Python
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
importpandasaspd
importvertexai
fromvertexai.preview.evaluationimport EvalTask
# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")
reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""
# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
{
"response": [
"""The Great Barrier Reef, the world's largest coral reef system located
in Australia, is a vast and diverse ecosystem. However, it faces serious
threats from climate change, ocean acidification, and coral bleaching,
endangering its rich marine life.""",
"""The Great Barrier Reef, a vast coral reef system off the coast of
Queensland, Australia, is the world's largest. It's a complex ecosystem
supporting diverse marine life, including endangered species. However,
climate change, ocean acidification, and coral bleaching are serious
threats to its survival.""",
"""The Great Barrier Reef, the world's largest coral reef system off the
coast of Australia, is a vast and diverse ecosystem with thousands of
reefs and islands. It is home to a multitude of marine life, including
endangered species, but faces serious threats from climate change, ocean
acidification, and coral bleaching.""",
],
"reference": [reference_summarization] * 3,
}
)
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[
"rouge_1",
"rouge_2",
"rouge_l",
"rouge_l_sum",
],
)
result = eval_task.evaluate()
print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
print(f"{key}: \t{value}")
print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count: 3
# rouge_1/mean: 0.7191161666666667
# rouge_1/std: 0.06765143922270488
# rouge_2/mean: 0.5441118566666666
# ...
# Metrics Table:
#
# response reference ... rouge_l/score rouge_l_sum/score
# 0 The Great Barrier Reef, the world's ... \n The Great Barrier Reef, the ... ... 0.577320 0.639175
# 1 The Great Barrier Reef, a vast coral... \n The Great Barrier Reef, the ... ... 0.552381 0.666667
# 2 The Great Barrier Reef, the world's ... \n The Great Barrier Reef, the ... ... 0.774775 0.774775Go
Go
Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
import(
"context"
"fmt"
"io"
aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"
aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
"google.golang.org/api/option"
)
// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
funcgetROUGEScore(wio.Writer,projectID,locationstring)error{
// location = "us-central1"
ctx:=context.Background()
apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)
client,err:=aiplatform.NewEvaluationClient (ctx,option.WithEndpoint(apiEndpoint))
iferr!=nil{
returnfmt.Errorf("unable to create aiplatform client: %w",err)
}
deferclient.Close()
modelResponse:=`
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
reference:=`
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
req:=aiplatformpb.EvaluateInstancesRequest{
Location:fmt.Sprintf("projects/%s/locations/%s",projectID,location),
MetricInputs:&aiplatformpb.EvaluateInstancesRequest_RougeInput{
RougeInput:&aiplatformpb.RougeInput{
// Check the API reference for the list of supported ROUGE metric types:
// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
MetricSpec:&aiplatformpb.RougeSpec{
RougeType:"rouge1",
},
Instances:[]*aiplatformpb.RougeInstance{
{
Prediction:&modelResponse,
Reference:&reference,
},
},
},
},
}
resp,err:=client.EvaluateInstances(ctx,&req)
iferr!=nil{
returnfmt.Errorf("evaluateInstances failed: %v",err)
}
fmt.Fprintln(w,"evaluation results:")
fmt.Fprintln(w,resp.GetRougeResults().GetRougeMetricValues())
// Example response:
// [score:0.6597938]
returnnil
}
What's next
- For detailed documentation, see Run an evaluation.