Class EvalTask (1.68.0)

EvalTask(
 *,
 dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
 metrics: typing.List[
 typing.Union[
 typing.Literal[
 "exact_match",
 "bleu",
 "rouge_1",
 "rouge_2",
 "rouge_l",
 "rouge_l_sum",
 "tool_call_valid",
 "tool_name_match",
 "tool_parameter_key_match",
 "tool_parameter_kv_match",
 ],
 vertexai.evaluation.CustomMetric,
 vertexai.evaluation.metrics._base._AutomaticMetric,
 vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
 vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
 ]
 ],
 experiment: typing.Optional[str] = None,
 metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
 output_uri_prefix: typing.Optional[str] = ""
)

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

Default dataset column names:
 * prompt_column_name: "prompt"
 * reference_column_name: "reference"
 * response_column_name: "response"
 * baseline_model_response_column_name: "baseline_model_response"
Requirement for different use cases:
 * Bring-your-own-response: A `response` column is required. Response
 column name can be customized by providing `response_column_name`
 parameter. If a pairwise metric is used and a baseline model is
 not provided, a `baseline_model_response` column is required.
 Baseline model response column name can be customized by providing
 `baseline_model_response_column_name` parameter. If the `response`
 column or `baseline_model_response` column is present while the
 corresponding model is specified, an error will be raised.
 * Perform model inference without a prompt template: A `prompt` column
 in the evaluation dataset representing the input prompt to the
 model is required and is used directly as input to the model.
 * Perform model inference with a prompt template: Evaluation dataset
 must contain column names corresponding to the variable names in
 the prompt template. For example, if prompt template is
 "Instruction: {instruction}, context: {context}", the dataset must
 contain `instruction` and `context` columns.

Metrics Details:

The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage Examples:

1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.
 ```
 eval_dataset = pd.DataFrame({
 "prompt" : [...],
 "reference": [...],
 "response" : [...],
 "baseline_model_response": [...],
 })
 eval_task = EvalTask(
 dataset=eval_dataset,
 metrics=[
 "bleu",
 "rouge_l_sum",
 MetricPromptTemplateExamples.Pointwise.FLUENCY,
 MetricPromptTemplateExamples.Pairwise.SAFETY
 ],
 experiment="my-experiment",
 )
 eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
 ```
2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance. The input column name to the
model is `prompt` and must be present in the dataset.
 ```
 eval_dataset = pd.DataFrame({
 "reference": [...],
 "prompt" : [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
 experiment="my-experiment",
 ).evaluate(
 model=GenerativeModel("gemini-1.5-pro"),
 experiment_run_name="gemini-eval-run"
 )
 ```
3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
 ```
 eval_dataset = pd.DataFrame({
 "context" : [...],
 "instruction": [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
 ).evaluate(
 model=GenerativeModel("gemini-1.5-pro"),
 prompt_template="{instruction}. Article: {context}. Summary:",
 )
 ```
4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.
 ```
 from openai import OpenAI
 client = OpenAI()
 def custom_model_fn(input: str) -> str:
 response = client.chat.completions.create(
 model="gpt-3.5-turbo",
 messages=[
 {"role": "user", "content": input}
 ]
 )
 return response.choices[0].message.content
 eval_dataset = pd.DataFrame({
 "prompt" : [...],
 "reference": [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
 experiment="my-experiment",
 ).evaluate(
 model=custom_model_fn,
 experiment_run_name="gpt-eval-run"
 )
 ```
5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.
 ```
 baseline_model = GenerativeModel("gemini-1.0-pro")
 candidate_model = GenerativeModel("gemini-1.5-pro")
 pairwise_groundedness = PairwiseMetric(
 metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
 "pairwise_groundedness"
 ),
 baseline_model=baseline_model,
 )
 eval_dataset = pd.DataFrame({
 "prompt" : [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=[pairwise_groundedness],
 experiment="my-pairwise-experiment",
 ).evaluate(
 model=candidate_model,
 experiment_run_name="gemini-pairwise-eval-run",
 )
 ```

Properties

dataset

Returns evaluation dataset.

experiment

Returns experiment name.

metrics

Returns metrics.

Methods

EvalTask

EvalTask(
 *,
 dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
 metrics: typing.List[
 typing.Union[
 typing.Literal[
 "exact_match",
 "bleu",
 "rouge_1",
 "rouge_2",
 "rouge_l",
 "rouge_l_sum",
 "tool_call_valid",
 "tool_name_match",
 "tool_parameter_key_match",
 "tool_parameter_kv_match",
 ],
 vertexai.evaluation.CustomMetric,
 vertexai.evaluation.metrics._base._AutomaticMetric,
 vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
 vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
 ]
 ],
 experiment: typing.Optional[str] = None,
 metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
 output_uri_prefix: typing.Optional[str] = ""
)

Initializes an EvalTask.

display_runs

display_runs()

Displays experiment runs associated with this EvalTask.

evaluate

evaluate(
 *,
 model: typing.Optional[
 typing.Union[
 vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
 ]
 ] = None,
 prompt_template: typing.Optional[str] = None,
 experiment_run_name: typing.Optional[str] = None,
 response_column_name: typing.Optional[str] = None,
 baseline_model_response_column_name: typing.Optional[str] = None,
 evaluation_service_qps: typing.Optional[float] = None,
 retry_timeout: float = 600.0,
 output_file_name: typing.Optional[str] = None
) -> vertexai.evaluation.EvalResult

Runs an evaluation for the EvalTask.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年10月30日 UTC.