Class EvalTask (1.68.0)

EvalTask(
 *,
 dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
 metrics: typing.List[
 typing.Union[
 typing.Literal[
 "exact_match",
 "bleu",
 "rouge_1",
 "rouge_2",
 "rouge_l",
 "rouge_l_sum",
 "tool_call_valid",
 "tool_name_match",
 "tool_parameter_key_match",
 "tool_parameter_kv_match",
 ],
 vertexai.evaluation.CustomMetric,
 vertexai.evaluation.metrics._base._AutomaticMetric,
 vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
 vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
 ]
 ],
 experiment: typing.Optional[str] = None,
 metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
 output_uri_prefix: typing.Optional[str] = ""
)

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

Default dataset column names:
 * prompt_column_name: "prompt"
 * reference_column_name: "reference"
 * response_column_name: "response"
 * baseline_model_response_column_name: "baseline_model_response"
Requirement for different use cases:
 * Bring-your-own-response: A `response` column is required. Response
 column name can be customized by providing `response_column_name`
 parameter. If a pairwise metric is used and a baseline model is
 not provided, a `baseline_model_response` column is required.
 Baseline model response column name can be customized by providing
 `baseline_model_response_column_name` parameter. If the `response`
 column or `baseline_model_response` column is present while the
 corresponding model is specified, an error will be raised.
 * Perform model inference without a prompt template: A `prompt` column
 in the evaluation dataset representing the input prompt to the
 model is required and is used directly as input to the model.
 * Perform model inference with a prompt template: Evaluation dataset
 must contain column names corresponding to the variable names in
 the prompt template. For example, if prompt template is
 "Instruction: {instruction}, context: {context}", the dataset must
 contain `instruction` and `context` columns.

Metrics Details:

The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage Examples:

1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.
 ```
 eval_dataset = pd.DataFrame({
 "prompt" : [...],
 "reference": [...],
 "response" : [...],
 "baseline_model_response": [...],
 })
 eval_task = EvalTask(
 dataset=eval_dataset,
 metrics=[
 "bleu",
 "rouge_l_sum",
 MetricPromptTemplateExamples.Pointwise.FLUENCY,
 MetricPromptTemplateExamples.Pairwise.SAFETY
 ],
 experiment="my-experiment",
 )
 eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
 ```
2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance. The input column name to the
model is `prompt` and must be present in the dataset.
 ```
 eval_dataset = pd.DataFrame({
 "reference": [...],
 "prompt" : [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
 experiment="my-experiment",
 ).evaluate(
 model=GenerativeModel("gemini-1.5-pro"),
 experiment_run_name="gemini-eval-run"
 )
 ```
3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
 ```
 eval_dataset = pd.DataFrame({
 "context" : [...],
 "instruction": [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
 ).evaluate(
 model=GenerativeModel("gemini-1.5-pro"),
 prompt_template="{instruction}. Article: {context}. Summary:",
 )
 ```
4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.
 ```
 from openai import OpenAI
 client = OpenAI()
 def custom_model_fn(input: str) -> str:
 response = client.chat.completions.create(
 model="gpt-3.5-turbo",
 messages=[
 {"role": "user", "content": input}
 ]
 )
 return response.choices[0].message.content
 eval_dataset = pd.DataFrame({
 "prompt" : [...],
 "reference": [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
 experiment="my-experiment",
 ).evaluate(
 model=custom_model_fn,
 experiment_run_name="gpt-eval-run"
 )
 ```
5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.
 ```
 baseline_model = GenerativeModel("gemini-1.0-pro")
 candidate_model = GenerativeModel("gemini-1.5-pro")
 pairwise_groundedness = PairwiseMetric(
 metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
 "pairwise_groundedness"
 ),
 baseline_model=baseline_model,
 )
 eval_dataset = pd.DataFrame({
 "prompt" : [...],
 })
 result = EvalTask(
 dataset=eval_dataset,
 metrics=[pairwise_groundedness],
 experiment="my-pairwise-experiment",
 ).evaluate(
 model=candidate_model,
 experiment_run_name="gemini-pairwise-eval-run",
 )
 ```

Properties

dataset

Returns evaluation dataset.

experiment

Returns experiment name.

metrics

Returns metrics.

Methods

EvalTask

EvalTask(
 *,
 dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
 metrics: typing.List[
 typing.Union[
 typing.Literal[
 "exact_match",
 "bleu",
 "rouge_1",
 "rouge_2",
 "rouge_l",
 "rouge_l_sum",
 "tool_call_valid",
 "tool_name_match",
 "tool_parameter_key_match",
 "tool_parameter_kv_match",
 ],
 vertexai.evaluation.CustomMetric,
 vertexai.evaluation.metrics._base._AutomaticMetric,
 vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
 vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
 ]
 ],
 experiment: typing.Optional[str] = None,
 metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
 output_uri_prefix: typing.Optional[str] = ""
)

Initializes an EvalTask.

display_runs

display_runs()

Displays experiment runs associated with this EvalTask.

evaluate

evaluate(
 *,
 model: typing.Optional[
 typing.Union[
 vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
 ]
 ] = None,
 prompt_template: typing.Optional[str] = None,
 experiment_run_name: typing.Optional[str] = None,
 response_column_name: typing.Optional[str] = None,
 baseline_model_response_column_name: typing.Optional[str] = None,
 evaluation_service_qps: typing.Optional[float] = None,
 retry_timeout: float = 600.0,
 output_file_name: typing.Optional[str] = None
) -> vertexai.evaluation.EvalResult

Runs an evaluation for the EvalTask.

Class EvalTask (1.68.0) Stay organized with collections Save and categorize content based on your preferences.