New evals + eval tagging #1015

Adds pytest markers loki and embeds, updates many test fixture metadata (tags/evaluation/skip), adds two new log-focused test scenarios (error_log_flood, multicontainer_logs with manifests/toolsets), and removes a Tempo fixture suite and a truncated-logs scenario. Changes are test fixtures, manifests, and metadata only.

Changes

Cohort / File(s)	Summary of changes
Pytest markers `pyproject.toml`	Added pytest markers: `loki`, `embeds`.
Loki tag additions `tests/llm/fixtures/test_ask_holmes/.../100a_historical_logs/test_case.yaml`, `.../100b_historical_logs_nonstandard_label/test_case.yaml`, `.../101_historical_logs_pod_deleted/test_case.yaml`, `.../102_loki_label_discovery/test_case.yaml`, `.../103_logs_transparency_default_limit/test_case.yaml`, `.../98_logs_transparency_default_time/test_case.yaml`	Added `loki` to tests’ `tags` metadata; no logic changes.
Prometheus / embeds / metrics tag & port updates `tests/llm/fixtures/test_ask_holmes/30_basic_promql_graph_cluster_memory/test_case.yaml`, `.../32_basic_promql_graph_pod_cpu/test_case.yaml`, `.../34_memory_graph/test_case.yaml`, `.../33_cpu_metrics_discovery/test_case.yaml`, `.../159_prometheus_high_cardinality_cpu/test_case.yaml`, `.../160a_`, `.../160b_`, `.../160c_*`	Added tags (`embeds`, `metrics`, `medium` varies). Files 30/32/34 also added/updated `expected_output` and expanded `port_forwards` to include `local_port`/`remote_port`. Mostly metadata edits.
New Relic tag / description updates `tests/llm/fixtures/test_ask_holmes/118_new_relic_logs/test_case.yaml`, `.../119_new_relic_metrics/test_case.yaml`, `.../120_new_relic_traces2/test_case.yaml`, `.../121_new_relic_checkout_errors_tracing/test_case.yaml`, `.../122_new_relic_checkout_latency_tracing_rebuild/test_case.yaml`, `.../123_new_relic_checkout_errors_tracing/test_case.yaml`	Added tags (`medium`, `newrelic`) and updated description in one fixture. Metadata only.
Evaluation block removals & medium tag additions `tests/llm/fixtures/test_ask_holmes/49_logs_since_last_week/test_case.yaml`, `.../50`, `.../57`, `.../66`, `.../67`, `.../68`, `.../69`, `.../71`, `.../74`, `.../75*`	Removed `evaluation` sections across many fixtures; several gained `medium` tag. No runtime code changes.
Minor metadata edits / skip `tests/llm/fixtures/test_ask_holmes/45_fetch_deployment_logs_simple/test_case.yaml`, `.../48_logs_since_thursday/test_case.yaml`, `.../55_kafka_runbook/test_case.yaml`, `.../70_memory_leak_detection/test_case.yaml`, `.../91h_datadog_logs_empty_query_with_url/test_case.yaml`, `.../110_cpu_graph_robusta_runner/test_case.yaml`	Small metadata edits: added tags, removed comment, expanded skip_reason, and set `skip: true` for one test.
New scenario: error_log_flood `tests/llm/fixtures/test_ask_holmes/35_error_log_flood/manifest.yaml`, `.../test_case.yaml`, `.../toolsets.yaml`	Added Kubernetes manifest (Namespace, Secret with scripts, Deployment) that emits structured error log flood; added test_case YAML with root-cause expectation and `toolsets.yaml` enabling k8s core and logs.
New scenario: multi-container logs `tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/manifest.yaml`, `.../test_case.yaml`	Added manifest with three-container Deployment (scripts in Secret) and a test_case expecting identification of a sidecar connection-pool failure.
Removed: Tempo fixture suite `tests/llm/fixtures/test_ask_holmes/35_tempo/*`	Deleted multiple Tempo-related fixture files (`.txt` tool outputs), `test_case.yaml`, and `toolsets.yaml`.
Removed: truncated-logs scenario `tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/*`	Deleted manifest, kubectl outputs, and `test_case.yaml` for the truncated-logs context-window scenario.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

eval improvements #816 — Modifies pytest markers and updates test fixture tags (strong overlap in marker/tag changes).
move some evals out of easy + delete bad eval #828 — Edits the same test fixtures under tests/llm/fixtures/test_ask_holmes (fixture/tag/fixture-file changes).
Add tempo comparison test tool and pass test 114 #922 — Overlaps Tempo-related fixtures and toolset configuration edits (deletions/changes to Tempo fixtures).

Suggested labels

enhancement

Suggested reviewers

Sheeproid

Pre-merge checks and finishing touches

❌ Failed checks (2 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The title "New evals + eval tagging" is related to adding new test evaluations and updating tags but is too generic and does not clearly describe the addition of new pytest markers or the specific fixtures being updated. It lacks specificity about the core changes such as adding loki and embeds markers and tagging test cases accordingly.	Please update the title to explicitly mention the addition of new pytest markers (e.g., loki and embeds) and the corresponding test fixture tag updates so that the primary changes are clear at a glance.
Description Check	❓ Inconclusive	There is no pull request description provided, leaving readers without context on the scope or purpose of the extensive tag additions and new test fixtures. An absence of description fails to convey any meaningful information about the changeset.	Please add a brief description summarizing the main objectives of this pull request, such as introducing new pytest markers for loki and embeds and updating test fixture metadata to use these markers.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch eval-tags-and-improvements

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 849cab0 and 9bf3fc2.

📒 Files selected for processing (1)

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test files should mirror the source structure under tests/

Files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml

tests/llm/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/**/*.yaml: In eval manifests, ALWAYS use Kubernetes Secrets for scripts rather than inline manifests or ConfigMaps to prevent script exposure via kubectl describe
Eval resources must use neutral names, each test must use a dedicated namespace 'app-', and all pod names must be unique across tests

Files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml

tests/**/test_case.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

Do NOT put toolset configuration directly in test_case.yaml; keep toolset config in a separate toolsets.yaml

Files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Pre-commit checks
GitHub Check: llm_evals

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai[bot]

coderabbitai bot reviewed

Sep 30, 2025

View reviewed changes

Copy link

Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)

tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/test_case.yaml (2)

18-18: Use neutral, unique pod names in tests
The manifest defines domain-specific pods (orders-app, invoices-app, finance-app, accounting-app) that hint at the business logic and risk name collisions across tests. Per LLM test guidelines, rename these (and any other pods/services) to neutral, unique identifiers (e.g. app-a, app-b, app-c, etc.) in tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/app/kafka-manifest.yaml.

14-14: Use app-55 as the Kubernetes namespace.

In test_case.yaml (line 14), change -n ask-holmes-namespace-55 to -n app-55.

In kafka-manifest.yaml, update all namespace: ask-holmes-namespace-55 entries to namespace: app-55.

tests/llm/fixtures/test_ask_holmes/66_http_error_needle/test_case.yaml (1)

1-1: Replace namespace-66 with app-66 in the fixture.
In tests/llm/fixtures/test_ask_holmes/66_http_error_needle/test_case.yaml (line 1 and corresponding kubectl commands), update all occurrences of namespace-66 to app-66 to adhere to the app-<testid> namespace convention for LLM tests.
tests/llm/fixtures/test_ask_holmes/120_new_relic_traces2/test_case.yaml (1)
16-22: Critical: Namespace does not follow required pattern.

The namespace "nr" does not follow the required pattern app-<testid>. Based on the test ID (120), this should be app-120.

As per coding guidelines.

Apply this diff to fix the namespace:
 before_test: |
 [ -n "${NEW_RELIC_ACCOUNT_ID:-}" ] && [ -n "${NEW_RELIC_API_KEY:-}" ] && [ -n "${NEW_RELIC_LICENSE_KEY:-}" ] || { for v in NEW_RELIC_ACCOUNT_ID NEW_RELIC_API_KEY NEW_RELIC_LICENSE_KEY; do [ -n "${!v:-}" ] || echo "Missing env var: $v"; done; exit 1; }
- kubectl create namespace nr
- kubectl create secret generic newrelickey --from-literal=key="${NEW_RELIC_LICENSE_KEY}" -n nr
+ kubectl create namespace app-120
+ kubectl create secret generic newrelickey --from-literal=key="${NEW_RELIC_LICENSE_KEY}" -n app-120
 kubectl apply -f ./deployment.yaml
 
 after_test: |
- kubectl delete secret newrelickey -n nr
+ kubectl delete secret newrelickey -n app-120
 kubectl delete -f ./deployment.yaml
tests/llm/fixtures/test_ask_holmes/119_new_relic_metrics/test_case.yaml (1)
21-30: Missing namespace specification in kubectl commands.

The before_test and after_test scripts don't specify a namespace, which means resources will be created in the default namespace. Per coding guidelines, LLM tests must use a dedicated namespace of the form app-<testid>. This test should use app-119.

As per coding guidelines.

Apply this diff to add the required namespace:
 before_test: |
 [ -n "${NEW_RELIC_ACCOUNT_ID:-}" ] && [ -n "${NEW_RELIC_API_KEY:-}" ] && [ -n "${NEW_RELIC_LICENSE_KEY:-}" ] || { for v in NEW_RELIC_ACCOUNT_ID NEW_RELIC_API_KEY NEW_RELIC_LICENSE_KEY; do [ -n "${!v:-}" ] || echo "Missing env var: $v"; done; exit 1; }
- kubectl create secret generic newrelickey --from-literal=key="${NEW_RELIC_LICENSE_KEY}"
+ kubectl create namespace app-119 || true
+ kubectl create secret generic newrelickey --from-literal=key="${NEW_RELIC_LICENSE_KEY}" -n app-119
- kubectl apply -f ./sales.yaml
+ kubectl apply -f ./sales.yaml -n app-119
 sleep 120
 
 
 after_test: |
- kubectl delete secret newrelickey
- kubectl delete -f ./sales.yaml
+ kubectl delete namespace app-119 --ignore-not-found=true
tests/llm/fixtures/test_ask_holmes/91h_datadog_logs_empty_query_with_url/test_case.yaml (1)

6-10: Add missing pytest marker "datadog"
The datadog tag in tests/llm/fixtures/test_ask_holmes/91h_datadog_logs_empty_query_with_url/test_case.yaml isn’t declared in pyproject.toml; either add it under tool.pytest.ini_options.markers or remove it.

tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml (1)

12-18: Add undeclared pytest markers ‘prometheus’ and ‘embeds’ to pyproject.toml
tests/llm/fixtures/test_ask_holmes/70_memory_leak_detection/test_case.yaml (1)
10-10: Update namespace to app-70.
In tests/llm/fixtures/test_ask_holmes/70_memory_leak_detection/test_case.yaml at line 10, replace
kubectl create namespace namespace-70 || true
with
kubectl create namespace app-70 || true
to comply with the app-<testid> naming convention.
tests/llm/fixtures/test_ask_holmes/74_config_change_impact/test_case.yaml (1)
11-22: Namespace naming violates coding guidelines.

The test uses namespace-74 but according to coding guidelines, each LLM test must use a dedicated Kubernetes namespace of the form app-<testid>. This test should use app-74 instead of namespace-74.

As per coding guidelines: Each LLM test must use a dedicated Kubernetes namespace of the form app-.

Apply this diff to fix the namespace naming:
 before_test: |
 # Create namespace first since the secret depends on it (|| true ignores if it already exists)
- kubectl create namespace namespace-74 || true
+ kubectl create namespace app-74 || true
 kubectl create secret generic cache-service-logs-script \
 --from-file=generate_logs.py=./generate_logs.py \
- -n namespace-74 --dry-run=client -o yaml | kubectl apply -f -
+ -n app-74 --dry-run=client -o yaml | kubectl apply -f -
 kubectl apply -f ./manifest.yaml
 sleep 40
 after_test: |
 kubectl delete -f ./manifest.yaml
- kubectl delete secret cache-service-logs-script -n namespace-74 --ignore-not-found
- kubectl delete namespace namespace-74 --ignore-not-found
+ kubectl delete secret cache-service-logs-script -n app-74 --ignore-not-found
+ kubectl delete namespace app-74 --ignore-not-found
Additionally, verify that ./manifest.yaml uses the same namespace:
#!/bin/bash
# Check the namespace used in the manifest file
echo "=== Checking namespace in manifest.yaml for test 74 ==="
fd -e yaml manifest.yaml tests/llm/fixtures/test_ask_holmes/74_config_change_impact/ --exec cat {} | rg 'namespace:' -B 2 -A 1
tests/llm/fixtures/test_ask_holmes/118_new_relic_logs/test_case.yaml (1)
17-26: Missing dedicated namespace for test.

According to the coding guidelines, "Each LLM test must use a dedicated Kubernetes namespace of form app-." This test (ID: 118) should create and use namespace app-118, but the before_test and after_test scripts do not specify a namespace, defaulting to the default namespace.

Apply this diff to use a dedicated namespace:
 before_test: |
 [ -n "${NEW_RELIC_ACCOUNT_ID:-}" ] && [ -n "${NEW_RELIC_API_KEY:-}" ] && [ -n "${NEW_RELIC_LICENSE_KEY:-}" ] || { for v in NEW_RELIC_ACCOUNT_ID NEW_RELIC_API_KEY NEW_RELIC_LICENSE_KEY; do [ -n "${!v:-}" ] || echo "Missing env var: $v"; done; exit 1; }
 
- kubectl create secret generic newrelickey --from-literal=key="${NEW_RELIC_LICENSE_KEY}"
- kubectl apply -f ./nrlogger.yaml
+ kubectl create namespace app-118 || true
+ kubectl create secret generic newrelickey --from-literal=key="${NEW_RELIC_LICENSE_KEY}" -n app-118
+ kubectl apply -f ./nrlogger.yaml -n app-118
 sleep 300
 
 after_test: |
- kubectl delete secret newrelickey
- kubectl delete -f ./nrlogger.yaml
+ kubectl delete secret newrelickey -n app-118 --ignore-not-found
+ kubectl delete -f ./nrlogger.yaml -n app-118 --ignore-not-found
+ kubectl delete namespace app-118 --ignore-not-found
As per coding guidelines
tests/llm/fixtures/test_ask_holmes/75_network_flapping/test_case.yaml (1)
11-20: Namespace violates naming convention.

The test uses "namespace-75" but the coding guideline requires namespaces of the form "app-" (app-75 in this case).

As per coding guidelines.

Apply this diff to fix the namespace:
- kubectl create namespace namespace-75 || true
+ kubectl create namespace app-75 || true
 kubectl create secret generic frontend-logs-script \
 --from-file=generate_logs.py=./generate_logs.py \
- -n namespace-75 --dry-run=client -o yaml | kubectl apply -f -
+ -n app-75 --dry-run=client -o yaml | kubectl apply -f -
 kubectl apply -f ./manifest.yaml
 sleep 40
 after_test: |
 kubectl delete -f ./manifest.yaml
- kubectl delete secret frontend-logs-script -n namespace-75 --ignore-not-found
+ kubectl delete secret frontend-logs-script -n app-75 --ignore-not-found
- kubectl delete namespace namespace-75 --ignore-not-found
+ kubectl delete namespace app-75 --ignore-not-found
Note: You'll also need to update the manifest.yaml to reference app-75.

♻️ Duplicate comments (3)

tests/llm/fixtures/test_ask_holmes/49_logs_since_last_week/test_case.yaml (1)

7-7: Verify that the "medium" tag is a declared pytest marker.

Similar to the concern in file 160b_cpu_per_namespace_graph_with_prom_truncation/test_case.yaml, ensure that medium corresponds to a declared pytest marker in pyproject.toml.

Based on learnings

tests/llm/fixtures/test_ask_holmes/98_logs_transparency_default_time/test_case.yaml (1)

22-22: Verify that the "medium" tag is a declared pytest marker.

Same concern as in previous files: ensure that medium corresponds to a declared pytest marker in pyproject.toml.

Based on learnings

tests/llm/fixtures/test_ask_holmes/118_new_relic_logs/test_case.yaml (1)

12-12: Verify that the "medium" tag is a declared pytest marker.

Same concern as in previous files: ensure that medium corresponds to a declared pytest marker in pyproject.toml.

Based on learnings

🧹 Nitpick comments (2)

tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/manifest.yaml (1)
188-249: Consider adding securityContext for test isolation.

The deployment runs containers as root without allowPrivilegeEscalation: false or runAsNonRoot: true. While this may be acceptable for test fixtures, adding basic security constraints improves test isolation and follows best practices.

Apply this diff to add security context:
 spec:
 containers:
 - name: main
 image: python:3.11-slim
 command: ["python", "-u", "/scripts/main.py"]
+ securityContext:
+ allowPrivilegeEscalation: false
+ runAsNonRoot: true
+ runAsUser: 1000
 volumeMounts:
 - name: scripts
 mountPath: /scripts
 resources:
 requests:
 memory: "64Mi"
 cpu: "10m"
 limits:
 memory: "128Mi"
 cpu: "100m"
 - name: connection-manager
 image: python:3.11-slim
 command: ["python", "-u", "/scripts/sidecar.py"]
+ securityContext:
+ allowPrivilegeEscalation: false
+ runAsNonRoot: true
+ runAsUser: 1000
 volumeMounts:
 - name: scripts
 mountPath: /scripts
 resources:
 requests:
 memory: "32Mi"
 cpu: "10m"
 limits:
 memory: "64Mi"
 cpu: "50m"
 - name: metrics-monitor
 image: python:3.11-slim
 command: ["python", "-u", "/scripts/monitor.py"]
+ securityContext:
+ allowPrivilegeEscalation: false
+ runAsNonRoot: true
+ runAsUser: 1000
 volumeMounts:
 - name: scripts
 mountPath: /scripts
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/manifest.yaml (1)
155-190: Consider adding a security context for best practices.

The Deployment uses neutral naming that doesn't hint at the problem, which is good. However, the container lacks a security context. While this is a test fixture, adding basic security controls is recommended.

Consider adding a security context to the container:
 - name: api
 image: python:3.11-slim
 command: ["python", "-u", "/scripts/app.py"]
+ securityContext:
+ allowPrivilegeEscalation: false
+ runAsNonRoot: true
+ runAsUser: 1000
+ capabilities:
+ drop:
+ - ALL
 volumeMounts:
 - name: scripts
 mountPath: /scripts

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d00778f and 4339e52.

📒 Files selected for processing (65)

pyproject.toml (1 hunks)
tests/llm/fixtures/test_ask_holmes/100a_historical_logs/test_case.yaml (2 hunks)
tests/llm/fixtures/test_ask_holmes/100b_historical_logs_nonstandard_label/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/101_historical_logs_pod_deleted/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/102_loki_label_discovery/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/103_logs_transparency_default_limit/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/110_cpu_graph_robusta_runner/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/117b_new_relic_block_embed/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/118_new_relic_logs/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/119_new_relic_metrics/test_case.yaml (2 hunks)
tests/llm/fixtures/test_ask_holmes/120_new_relic_traces2/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/121_new_relic_checkout_errors_tracing/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/122_new_relic_checkout_latency_tracing_rebuild/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/123_new_relic_checkout_errors_tracing/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/160a_cpu_per_namespace_graph/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/160b_cpu_per_namespace_graph_with_prom_truncation/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/160c_cpu_per_namespace_graph_with_global_truncation/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/30_basic_promql_graph_cluster_memory/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/32_basic_promql_graph_pod_cpu/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/33_cpu_metrics_discovery/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/manifest.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/fetch_tempo_trace_by_id.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/fetch_tempo_trace_by_id_2.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/fetch_tempo_traces_by_deployment.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe_backend_deployment.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe_checkout.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe_fraud.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_get_by_kind_in_namespace_depl.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_get_by_kind_in_namespace_namespace.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_lineage_children.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_logs_checkout.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_memory_requests_namespace.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_top_nodes.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_top_pods.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/test_case.yaml (0 hunks)
tests/llm/fixtures/test_ask_holmes/35_tempo/toolsets.yaml (0 hunks)
tests/llm/fixtures/test_ask_holmes/45_fetch_deployment_logs_simple/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/manifest.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/kubectl_describe_pod_long-logs-app-84fbcbfb5f-57zrx_ask-holmes-namespace-47.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/kubectl_find_resource_long-logs-app_pod.txt (0 hunks)
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/manifest.yaml (0 hunks)
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/test_case.yaml (0 hunks)
tests/llm/fixtures/test_ask_holmes/48_logs_since_thursday/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/49_logs_since_last_week/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/50_logs_since_specific_date/test_case.yaml (0 hunks)
tests/llm/fixtures/test_ask_holmes/50a_logs_since_last_specific_month/test_case.yaml (0 hunks)
tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/57_cluster_name_confusion/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/57_wrong_namespace/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/66_http_error_needle/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/67_performance_degradation/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/68_cascading_failures/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/69_rate_limit_exhaustion/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/70_memory_leak_detection/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/71_connection_pool_starvation/test_case.yaml (0 hunks)
tests/llm/fixtures/test_ask_holmes/74_config_change_impact/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/75_network_flapping/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/91h_datadog_logs_empty_query_with_url/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/98_logs_transparency_default_time/test_case.yaml (1 hunks)

💤 Files with no reviewable changes (23)

tests/llm/fixtures/test_ask_holmes/35_tempo/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_top_nodes.txt
tests/llm/fixtures/test_ask_holmes/50_logs_since_specific_date/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_tempo/fetch_tempo_trace_by_id.txt
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_get_by_kind_in_namespace_namespace.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe_fraud.txt
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/kubectl_describe_pod_long-logs-app-84fbcbfb5f-57zrx_ask-holmes-namespace-47.txt
tests/llm/fixtures/test_ask_holmes/71_connection_pool_starvation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/50a_logs_since_last_specific_month/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_tempo/toolsets.yaml
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe_checkout.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_logs_checkout.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_memory_requests_namespace.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_describe_backend_deployment.txt
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/manifest.yaml
tests/llm/fixtures/test_ask_holmes/35_tempo/fetch_tempo_traces_by_deployment.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_lineage_children.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/fetch_tempo_trace_by_id_2.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_get_by_kind_in_namespace_depl.txt
tests/llm/fixtures/test_ask_holmes/47_truncated_logs_context_window/kubectl_find_resource_long-logs-app_pod.txt
tests/llm/fixtures/test_ask_holmes/35_tempo/kubectl_top_pods.txt

🧰 Additional context used

📓 Path-based instructions (4)

tests/llm/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/**/*.yaml: Each LLM test must use a dedicated Kubernetes namespace of the form app-
All pod names in LLM tests must be unique and use neutral naming that does not hint at the problem

Files:

tests/llm/fixtures/test_ask_holmes/160c_cpu_per_namespace_graph_with_global_truncation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/160b_cpu_per_namespace_graph_with_prom_truncation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/117b_new_relic_block_embed/test_case.yaml
tests/llm/fixtures/test_ask_holmes/100a_historical_logs/test_case.yaml
tests/llm/fixtures/test_ask_holmes/70_memory_leak_detection/test_case.yaml
tests/llm/fixtures/test_ask_holmes/103_logs_transparency_default_limit/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml
tests/llm/fixtures/test_ask_holmes/100b_historical_logs_nonstandard_label/test_case.yaml
tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/test_case.yaml
tests/llm/fixtures/test_ask_holmes/123_new_relic_checkout_errors_tracing/test_case.yaml
tests/llm/fixtures/test_ask_holmes/118_new_relic_logs/test_case.yaml
tests/llm/fixtures/test_ask_holmes/121_new_relic_checkout_errors_tracing/test_case.yaml
tests/llm/fixtures/test_ask_holmes/122_new_relic_checkout_latency_tracing_rebuild/test_case.yaml
tests/llm/fixtures/test_ask_holmes/33_cpu_metrics_discovery/test_case.yaml
tests/llm/fixtures/test_ask_holmes/110_cpu_graph_robusta_runner/test_case.yaml
tests/llm/fixtures/test_ask_holmes/101_historical_logs_pod_deleted/test_case.yaml
tests/llm/fixtures/test_ask_holmes/120_new_relic_traces2/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/manifest.yaml
tests/llm/fixtures/test_ask_holmes/98_logs_transparency_default_time/test_case.yaml
tests/llm/fixtures/test_ask_holmes/45_fetch_deployment_logs_simple/test_case.yaml
tests/llm/fixtures/test_ask_holmes/102_loki_label_discovery/test_case.yaml
tests/llm/fixtures/test_ask_holmes/57_wrong_namespace/test_case.yaml
tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/test_case.yaml
tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml
tests/llm/fixtures/test_ask_holmes/74_config_change_impact/test_case.yaml
tests/llm/fixtures/test_ask_holmes/49_logs_since_last_week/test_case.yaml
tests/llm/fixtures/test_ask_holmes/69_rate_limit_exhaustion/test_case.yaml
tests/llm/fixtures/test_ask_holmes/67_performance_degradation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/manifest.yaml
tests/llm/fixtures/test_ask_holmes/66_http_error_needle/test_case.yaml
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml
tests/llm/fixtures/test_ask_holmes/57_cluster_name_confusion/test_case.yaml
tests/llm/fixtures/test_ask_holmes/75_network_flapping/test_case.yaml
tests/llm/fixtures/test_ask_holmes/30_basic_promql_graph_cluster_memory/test_case.yaml
tests/llm/fixtures/test_ask_holmes/160a_cpu_per_namespace_graph/test_case.yaml
tests/llm/fixtures/test_ask_holmes/32_basic_promql_graph_pod_cpu/test_case.yaml
tests/llm/fixtures/test_ask_holmes/119_new_relic_metrics/test_case.yaml
tests/llm/fixtures/test_ask_holmes/91h_datadog_logs_empty_query_with_url/test_case.yaml
tests/llm/fixtures/test_ask_holmes/68_cascading_failures/test_case.yaml
tests/llm/fixtures/test_ask_holmes/48_logs_since_thursday/test_case.yaml

tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test files should mirror the source structure under tests/

Files:

tests/llm/fixtures/test_ask_holmes/160c_cpu_per_namespace_graph_with_global_truncation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/160b_cpu_per_namespace_graph_with_prom_truncation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/117b_new_relic_block_embed/test_case.yaml
tests/llm/fixtures/test_ask_holmes/100a_historical_logs/test_case.yaml
tests/llm/fixtures/test_ask_holmes/70_memory_leak_detection/test_case.yaml
tests/llm/fixtures/test_ask_holmes/103_logs_transparency_default_limit/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml
tests/llm/fixtures/test_ask_holmes/100b_historical_logs_nonstandard_label/test_case.yaml
tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/test_case.yaml
tests/llm/fixtures/test_ask_holmes/123_new_relic_checkout_errors_tracing/test_case.yaml
tests/llm/fixtures/test_ask_holmes/118_new_relic_logs/test_case.yaml
tests/llm/fixtures/test_ask_holmes/121_new_relic_checkout_errors_tracing/test_case.yaml
tests/llm/fixtures/test_ask_holmes/122_new_relic_checkout_latency_tracing_rebuild/test_case.yaml
tests/llm/fixtures/test_ask_holmes/33_cpu_metrics_discovery/test_case.yaml
tests/llm/fixtures/test_ask_holmes/110_cpu_graph_robusta_runner/test_case.yaml
tests/llm/fixtures/test_ask_holmes/101_historical_logs_pod_deleted/test_case.yaml
tests/llm/fixtures/test_ask_holmes/120_new_relic_traces2/test_case.yaml
tests/llm/fixtures/test_ask_holmes/35_error_log_flood/manifest.yaml
tests/llm/fixtures/test_ask_holmes/98_logs_transparency_default_time/test_case.yaml
tests/llm/fixtures/test_ask_holmes/45_fetch_deployment_logs_simple/test_case.yaml
tests/llm/fixtures/test_ask_holmes/102_loki_label_discovery/test_case.yaml
tests/llm/fixtures/test_ask_holmes/57_wrong_namespace/test_case.yaml
tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/test_case.yaml
tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml
tests/llm/fixtures/test_ask_holmes/74_config_change_impact/test_case.yaml
tests/llm/fixtures/test_ask_holmes/49_logs_since_last_week/test_case.yaml
tests/llm/fixtures/test_ask_holmes/69_rate_limit_exhaustion/test_case.yaml
tests/llm/fixtures/test_ask_holmes/67_performance_degradation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/manifest.yaml
tests/llm/fixtures/test_ask_holmes/66_http_error_needle/test_case.yaml
tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml
tests/llm/fixtures/test_ask_holmes/57_cluster_name_confusion/test_case.yaml
tests/llm/fixtures/test_ask_holmes/75_network_flapping/test_case.yaml
tests/llm/fixtures/test_ask_holmes/30_basic_promql_graph_cluster_memory/test_case.yaml
tests/llm/fixtures/test_ask_holmes/160a_cpu_per_namespace_graph/test_case.yaml
tests/llm/fixtures/test_ask_holmes/32_basic_promql_graph_pod_cpu/test_case.yaml
tests/llm/fixtures/test_ask_holmes/119_new_relic_metrics/test_case.yaml
tests/llm/fixtures/test_ask_holmes/91h_datadog_logs_empty_query_with_url/test_case.yaml
tests/llm/fixtures/test_ask_holmes/68_cascading_failures/test_case.yaml
tests/llm/fixtures/test_ask_holmes/48_logs_since_thursday/test_case.yaml

tests/**/toolsets.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/toolsets.yaml: In evals, ALL toolset-specific configuration must be under the config field in toolsets.yaml (do not place toolset config in test_case.yaml)
toolsets.yaml may only use the allowed top-level fields: enabled, name, description, additional_instructions, prerequisites, tools, docs_url, icon_url, installation_instructions, config, url (MCP only)

Files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml

pyproject.toml

📄 CodeRabbit inference engine (CLAUDE.md)

Define and maintain the canonical list of pytest markers in pyproject.toml

Files:

pyproject.toml

🧠 Learnings (7)

📓 Common learnings

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to tests/**/*.py : Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

📚 Learning: 2025年09月28日T19:51:49.611Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to holmes/plugins/toolsets/**/*.yaml : Toolsets MUST return detailed error messages to enable LLM self-correction

Applied to files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml

📚 Learning: 2025年09月28日T19:51:49.611Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to holmes/plugins/toolsets/**/*.yaml : Toolsets must be defined as YAML under holmes/plugins/toolsets (either as {name}.yaml or within a {name}/ directory)

Applied to files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml

📚 Learning: 2025年09月28日T19:51:49.611Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to tests/**/toolsets.yaml : In evals, ALL toolset-specific configuration must be under the config field in toolsets.yaml (do not place toolset config in test_case.yaml)

Applied to files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml

📚 Learning: 2025年08月05日T00:42:23.792Z

Learnt from: vishiy
PR: robusta-dev/holmesgpt#782
File: config.example.yaml:31-49
Timestamp: 2025年08月05日T00:42:23.792Z
Learning: In robusta-dev/holmesgpt config.example.yaml, the azuremonitorlogs toolset configuration shows "enabled: true" as an example of how to enable the toolset, not as a default setting. The toolset is disabled by default and requires explicit enablement in user configurations.

Applied to files:

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml

📚 Learning: 2025年09月28日T19:51:49.611Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to pyproject.toml : Define and maintain the canonical list of pytest markers in pyproject.toml

Applied to files:

pyproject.toml

📚 Learning: 2025年09月28日T19:51:49.611Z

Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to tests/**/*.py : Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags

Applied to files:

pyproject.toml

🪛 Checkov (3.2.334)

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/manifest.yaml

[medium] 155-190: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 155-190: Minimize the admission of root containers

(CKV_K8S_23)

tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/manifest.yaml

[medium] 188-249: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 188-249: Minimize the admission of root containers

(CKV_K8S_23)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Pre-commit checks
GitHub Check: llm_evals

🔇 Additional comments (60)

tests/llm/fixtures/test_ask_holmes/117b_new_relic_block_embed/test_case.yaml (1)

12-12: LGTM!

The addition of the embeds tag is appropriate. Based on the PR changes, this marker has been properly declared in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/manifest.yaml (5)

1-4: LGTM!

The namespace app-47 correctly follows the required app-<testid> naming convention.

As per coding guidelines.

13-68: LGTM!

The main application script generates realistic verbose logs with red herring errors (line 57-59) to test the LLM's ability to filter noise. The use of flush=True ensures logs are immediately visible.

70-126: LGTM!

The sidecar script effectively simulates a connection pool exhaustion scenario. The critical error appears only once at iteration 61 (lines 112-113), making it a good test of the LLM's ability to identify root causes in verbose multi-container logs.

128-185: LGTM!

The monitoring script generates high-volume verbose logs that add realistic noise to the test scenario, helping validate the LLM's ability to filter relevant information.

191-191: LGTM!

The deployment name data-processor uses neutral naming that doesn't hint at the problem being tested.

As per coding guidelines.

tests/llm/fixtures/test_ask_holmes/47_multicontainer_logs/test_case.yaml (2)

2-5: All tags are valid pytest markers. The markers logs, context_window, and kubernetes are declared in pyproject.toml.

9-10: Confirm or align manifest path resolution in tests
Most fixtures use kubectl ... -f manifest.yaml (no ./), so either remove the ./ prefix here for consistency or ensure the test harness runs in the fixture’s directory so that ./manifest.yaml resolves correctly.

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/toolsets.yaml (1)

1-5: LGTM!

The toolsets configuration correctly enables the Kubernetes core and logs toolsets needed for this log analysis test. The structure follows the required format.

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/test_case.yaml (4)

1-1: LGTM!

The user prompt correctly references namespace app-35, which matches the required app-<testid> pattern for test ID 35.

7-8: LGTM!

The expected output is clear and specific, correctly identifying the root cause that the LLM should discover (migration failure due to missing locations table).

10-13: LGTM!

The test lifecycle commands correctly apply and clean up the manifest, and the description clearly explains the test objective.

2-5: No action necessary: tags "logs", "context_window", and "database" are declared in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/35_error_log_flood/manifest.yaml (2)

1-4: LGTM!

The namespace correctly follows the required app-<testid> naming pattern for test ID 35.

6-153: LGTM!

The Secret correctly contains the test scenario:

Python app generates a realistic log flood with database errors

SQL migration deliberately references a non-existent locations table (line 129), creating the root cause that the test expects the LLM to identify

The Secret name uses neutral naming as required.

tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/test_case.yaml (1)

9-9: LGTM! Clearer skip reason.

The expanded skip_reason provides better context about the prerequisites (port-forward setup) and future improvements needed (Kafka timeout handling). This improves test maintainability.

tests/llm/fixtures/test_ask_holmes/123_new_relic_checkout_errors_tracing/test_case.yaml (2)

20-81: LGTM: Namespace and pod naming comply with guidelines.

The test correctly uses the app-123 namespace format and employs neutral pod names (billing, traffic-generator) that don't hint at the underlying problem.

13-13: newrelic pytest marker is declared The "newrelic" tag is already listed under [tool.pytest.ini_options] markers in pyproject.toml (line 122).

tests/llm/fixtures/test_ask_holmes/69_rate_limit_exhaustion/test_case.yaml (2)

1-1: Verify namespace naming convention.

The user prompt references "namespace-69", but coding guidelines require the format app-<testid>. This test should use app-69 instead of namespace-69.

As per coding guidelines.

5-5: Verify that the "medium" tag is a declared pytest marker.

Please ensure the "medium" tag is properly declared in pyproject.toml.

Based on learnings.

tests/llm/fixtures/test_ask_holmes/68_cascading_failures/test_case.yaml (2)

1-1: Verify namespace naming convention.

The user prompt references "namespace-68", but coding guidelines require the format app-<testid>. This test should use app-68 instead of namespace-68.

As per coding guidelines.

6-6: Verify that the "medium" tag is a declared pytest marker.

Please ensure the "medium" tag is properly declared in pyproject.toml.

Based on learnings.

tests/llm/fixtures/test_ask_holmes/66_http_error_needle/test_case.yaml (1)

5-5: No action needed: 'medium' marker is declared
The "medium" marker is already listed under tool.pytest.ini_options.markers in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/45_fetch_deployment_logs_simple/test_case.yaml (1)

7-8: pytest markers validated — Both "logs" and "kubernetes" are declared under [tool.pytest.ini_options] in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/122_new_relic_checkout_latency_tracing_rebuild/test_case.yaml (2)

20-23: LGTM! Namespace follows coding guidelines.

The namespace app-122 correctly follows the required pattern app-<testid> for LLM tests.

As per coding guidelines.

8-13: All tags are valid pytest markers. All five tags—kubernetes, hard, chain-of-causation, traces, and newrelic—are declared in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/159_prometheus_high_cardinality_cpu/test_case.yaml (1)

30-38: LGTM! Namespace follows coding guidelines.

The namespace app-159 correctly follows the required pattern app-<testid> for LLM tests.

As per coding guidelines.
tests/llm/fixtures/test_ask_holmes/119_new_relic_metrics/test_case.yaml (2)
1-1: LGTM! Description accurately reflects test purpose.

The description change from "pulls logs" to "pulls metrics" correctly aligns with the test's actual functionality.

12-15: Declare the missing pytest marker "newrelic"
The tag "newrelic" in tests/llm/fixtures/test_ask_holmes/119_new_relic_metrics/test_case.yaml isn’t listed under tool.pytest.ini_options.markers in pyproject.toml. Add it there (or remove the tag) to comply with the canonical marker list.
⛔ Skipped due to learnings
Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to pyproject.toml : Define and maintain the canonical list of pytest markers in pyproject.toml
Learnt from: CR
PR: robusta-dev/holmesgpt#0
File: CLAUDE.md:0-0
Timestamp: 2025年09月28日T19:51:49.611Z
Learning: Applies to tests/**/*.py : Only use pytest markers that are declared in pyproject.toml; never introduce undeclared markers/tags
tests/llm/fixtures/test_ask_holmes/160a_cpu_per_namespace_graph/test_case.yaml (1)

12-14: LGTM! Tags added for test categorization.

The addition of embeds, metrics, and medium tags expands test metadata for better categorization and filtering. Based on the PR summary, these correspond to new pytest markers declared in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/70_memory_leak_detection/test_case.yaml (1)

5-5: LGTM! Tag added for test categorization.

The addition of the medium tag expands test metadata for better categorization.

tests/llm/fixtures/test_ask_holmes/33_cpu_metrics_discovery/test_case.yaml (1)

10-10: LGTM! Tag added for test categorization.

The addition of the metrics tag appropriately categorizes this CPU metrics discovery test.

tests/llm/fixtures/test_ask_holmes/67_performance_degradation/test_case.yaml (2)

5-5: LGTM! Tag added for test categorization.

The addition of the medium tag expands test metadata for better categorization.

10-10: Verify namespace naming convention.

The namespace namespace-67 does not follow the required app-<testid> format specified in the coding guidelines for LLM tests. This should likely be app-67 instead for consistency with the guidelines.

As per coding guidelines.

tests/llm/fixtures/test_ask_holmes/160c_cpu_per_namespace_graph_with_global_truncation/test_case.yaml (2)

17-19: LGTM! Tags added for test categorization.

The addition of embeds, metrics, and medium tags expands test metadata consistently with related fixtures (160a, 160b).

30-31: Test skip is properly documented.

The skip reason clearly explains that the TOOL_MAX_ALLOCATED_CONTEXT_WINDOW_PCT environment variable override doesn't work as expected because it's read once at startup. This is good documentation for future reference.

tests/llm/fixtures/test_ask_holmes/30_basic_promql_graph_cluster_memory/test_case.yaml (2)

12-16: Port forward configuration is now explicit and clear.

The expanded port_forwards section with explicit local_port and remote_port improves clarity over the previous implicit configuration.

6-10: All tags declared in pytest markers Verified that prometheus, embeds, metrics, and medium are all listed under markers in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/74_config_change_impact/test_case.yaml (2)

6-10: Enhanced expected output improves test specification.

The addition of specific expected output items about configuration-change impact and performance patterns makes the test expectations clearer and more verifiable.

3-5: No action needed: pytest markers are declared
The tags logs, context_window, and medium appear under tool.pytest.ini_options.markers in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/48_logs_since_thursday/test_case.yaml (1)

2-5: All tags are declared pytest markers The logs, datetime, and medium markers are listed under [tool.pytest.ini_options].markers in pyproject.toml, so no changes are needed.

tests/llm/fixtures/test_ask_holmes/57_cluster_name_confusion/test_case.yaml (2)

3-6: All pytest markers declared
Both logs and medium appear under the markers list in pyproject.toml; no changes needed.

7-11: Verify external manifest’s namespace isolation
The external manifest at https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/image_pull_backoff/no_such_image.yaml contains no metadata.namespace fields. Per LLM test guidelines, resources must be scoped to the dedicated namespace app-57. Confirm that these resources will indeed be applied into app-57—for example, by adding metadata.namespace: app-57 to the manifest or by invoking kubectl -n app-57 apply ....

tests/llm/fixtures/test_ask_holmes/121_new_relic_checkout_errors_tracing/test_case.yaml (2)

13-13: Verify that the "newrelic" tag is a declared pytest marker.

Ensure that newrelic corresponds to a declared pytest marker in pyproject.toml. This tag is not mentioned in the PR summary as one of the newly added markers.

Based on learnings

1-2: Consider more neutral pod naming.

The pod name checkout (lines 26, and referenced throughout the setup) directly identifies the service type and is explicitly mentioned in the user prompt at line 2. According to the coding guidelines, "pod names must be unique and use neutral naming that does not hint at the problem."

Since the test investigates errors in the checkout service, using the name checkout may hint at which service is problematic. Consider using a more neutral name like service-a, web-app, or api-service to better align with the guideline.

As per coding guidelines

Also applies to: 20-26, 50-51

tests/llm/fixtures/test_ask_holmes/98_logs_transparency_default_time/test_case.yaml (1)

19-19: LGTM: loki tag addition.

The loki tag is mentioned in the PR summary as one of the newly introduced pytest markers, so this addition is appropriate.

tests/llm/fixtures/test_ask_holmes/160b_cpu_per_namespace_graph_with_prom_truncation/test_case.yaml (1)

15-17: All added tags are declared pytest markers. No changes needed: embeds, metrics, and medium all appear under [tool.pytest.ini_options.markers] in pyproject.toml.

tests/llm/fixtures/test_ask_holmes/32_basic_promql_graph_pod_cpu/test_case.yaml (3)

3-4: LGTM! Clear validation criteria for PromQL execution.

The expected_output explicitly validates that the tool_name and random_key are present in the response, which provides concrete verification criteria for the test.

12-16: LGTM! Port forward configuration is now explicit.

Adding explicit local_port and remote_port values improves clarity and makes the port forwarding configuration more maintainable.

6-10: All pytest markers declared. Verified that "embeds", "metrics", and "medium" are present in pyproject.toml.

pyproject.toml (1)

121-124: LGTM! New pytest markers properly declared.

The three new markers (loki, embeds, no-cicd) are well-documented and follow the established format. This maintains the canonical marker list as required.

Based on learnings.

tests/llm/fixtures/test_ask_holmes/103_logs_transparency_default_limit/test_case.yaml (1)

18-22: LGTM! Loki tag addition aligns with pyproject.toml.

The "loki" tag is properly added and matches the new marker declared in pyproject.toml. The test also correctly uses the "app-103" namespace following the required naming convention.

tests/llm/fixtures/test_ask_holmes/57_wrong_namespace/test_case.yaml (2)

7-9: Resolved: 'logs' and 'medium' markers are declared in pyproject.toml

1-5: Remove namespace naming guideline check for this fixture.
This test deliberately uses test-57 to validate the "wrong namespace" suggestion logic—no changes needed to manifest.yaml.

Likely an incorrect or invalid review comment.

tests/llm/fixtures/test_ask_holmes/75_network_flapping/test_case.yaml (1)

2-6: 'medium' tag already declared
The medium marker exists in pyproject.toml (line 108).

tests/llm/fixtures/test_ask_holmes/100b_historical_logs_nonstandard_label/test_case.yaml (1)

14-14: LGTM! Tag addition aligns with PR objectives.

The "loki" tag correctly categorizes this test fixture as Loki-related, consistent with the metadata expansion across test fixtures.

tests/llm/fixtures/test_ask_holmes/100a_historical_logs/test_case.yaml (2)

1-1: Minor: Empty line at file start.

This is a minor formatting change with no functional impact.

15-15: LGTM! Tag addition aligns with PR objectives.

The "loki" tag correctly categorizes this test fixture as Loki-related.

tests/llm/fixtures/test_ask_holmes/102_loki_label_discovery/test_case.yaml (1)

27-27: LGTM! Tag addition aligns with PR objectives.

The "loki" tag correctly categorizes this Loki label discovery test fixture.

tests/llm/fixtures/test_ask_holmes/101_historical_logs_pod_deleted/test_case.yaml (1)

18-18: Approve test marker addition
The "loki" marker is declared in pyproject.toml under [tool.pytest.ini_options] (e.g., line 121), so the new tag aligns with the existing pytest configuration.

tests/llm/fixtures/test_ask_holmes/110_cpu_graph_robusta_runner/test_case.yaml

Comment on lines +28 to +32

tags:

- datadog

- metrics

- medium

- embeds

Copy link

Contributor

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify all tags are declared pytest markers.

The test adds four new tags: datadog, metrics, medium, and embeds. While embeds is confirmed as a new marker in this PR, ensure all other tags (datadog, metrics, medium) are properly declared in pyproject.toml.

Based on learnings: Only use pytest markers that are declared in pyproject.toml.

🏁 Script executed:

#!/bin/bash
# Verify that all tags used in test fixtures are declared as pytest markers in pyproject.toml
# Extract markers from pyproject.toml
echo "=== Declared pytest markers in pyproject.toml ==="
rg -A 50 '^\[tool\.pytest\.ini_options\]' pyproject.toml | rg 'markers\s*=' -A 20
echo ""
echo "=== Tags used in this test case ==="
echo "datadog, metrics, medium, embeds"

Length of output: 1836

Add missing pytest markers datadog and embeds
Add both datadog and embeds to the markers list under [tool.pytest.ini_options] in pyproject.toml so all tags used in tests are declared.

🤖 Prompt for AI Agents

In
tests/llm/fixtures/test_ask_holmes/110_cpu_graph_robusta_runner/test_case.yaml
around lines 28 to 32, the test tags include "datadog" and "embeds" but those
markers are not declared; open pyproject.toml and under
[tool.pytest.ini_options] add "datadog" and "embeds" to the markers list so
every tag used in tests is declared (ensure syntax matches existing marker
entries and save).

tests/llm/fixtures/test_ask_holmes/120_new_relic_traces2/test_case.yaml

Comment on lines 9 to +12

tags:

- newrelic

- traces

- medium

Copy link

Contributor

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify that "medium" tag is a declared pytest marker.

The "medium" tag is being added. Per the learnings, only pytest markers declared in pyproject.toml should be used.

Based on learnings.

Run the following script to verify this tag is declared:

🏁 Script executed:

#!/bin/bash
# Description: Verify "medium" is a declared pytest marker
rg -A 1 'markers\s*=' pyproject.toml | grep -i medium || echo "⚠️ 'medium' marker not found in pyproject.toml"

Length of output: 46

Declare the "medium" pytest marker or remove its usage
The "medium" tag isn’t listed under [tool.pytest.ini_options] markers in pyproject.toml. Add it there or eliminate the tag from test_case.yaml to comply with pytest marker declarations.

🤖 Prompt for AI Agents

In tests/llm/fixtures/test_ask_holmes/120_new_relic_traces2/test_case.yaml
around lines 9 to 12 the test uses the "medium" tag but that marker is not
declared in [tool.pytest.ini_options] markers in pyproject.toml; either add a
"medium" marker entry to pyproject.toml under markers (e.g. 'medium:
description') or remove the "medium" tag from this test_case.yaml so pytest
marker declarations remain consistent.

tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml

Comment on lines 4 to +10

tags:

- prometheus

evaluation:

correctness: 1

- embeds

- metrics

- medium

Copy link

Contributor

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that all tags are declared pytest markers.

The tags prometheus, embeds, metrics, and medium are being added to this test fixture. According to project learnings, only pytest markers declared in pyproject.toml should be used.

Based on learnings.

Run the following script to verify these markers are declared:

🏁 Script executed:

#!/bin/bash
# Description: Check if the tags used in this test fixture are declared as pytest markers in pyproject.toml
# Search for pytest marker declarations
rg -A 20 '^\[tool\.pytest\.ini_options\]|^markers\s*=' pyproject.toml

Length of output: 1742

🏁 Script executed:

#!/bin/bash
# Show lines around port_forwards in the YAML file
rg -n 'port_forwards' -C3 tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml

Length of output: 152

Fix undeclared pytest markers and namespace

Remove or declare the prometheus and embeds tags in pyproject.toml—only metrics and medium are valid markers.
Change port_forwards.namespace from default to the required dedicated namespace app-34_memory_graph (must use app-<testid>).

🤖 Prompt for AI Agents

In tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml around
lines 5 to 10, the YAML uses undeclared pytest markers and the wrong namespace:
remove or declare the prometheus and embeds tags (only metrics and medium are
valid markers per pyproject.toml) — either delete the prometheus and embeds
lines or add them to pyproject.toml under [tool.pytest.ini_options].markers;
also change port_forwards.namespace from default to the required dedicated
namespace app-34_memory_graph (must use app-<testid>) so the test uses the
correct namespace.

tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml

- medium

port_forwards:

- namespace: default

Copy link

Contributor

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a dedicated namespace following the required naming convention.

The namespace default violates the coding guideline requiring each LLM test to use a dedicated Kubernetes namespace of the form app-<testid>. For this test, the namespace should be app-34_memory_graph.

As per coding guidelines.

Apply this diff to fix the namespace:

- - namespace: default
+ - namespace: app-34_memory_graph
 service: robusta-kube-prometheus-st-prometheus
 local_port: 9090
 remote_port: 9090

📝 Committable suggestion

!!️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- namespace: default

- namespace: app-34_memory_graph

service: robusta-kube-prometheus-st-prometheus

local_port: 9090

remote_port: 9090

🤖 Prompt for AI Agents

In tests/llm/fixtures/test_ask_holmes/34_memory_graph/test_case.yaml around line
12, the namespace is set to "default" which violates the guideline requiring a
dedicated namespace per test; update the namespace value to
"app-34_memory_graph" (i.e., replace the existing namespace entry with
namespace: app-34_memory_graph) so the test uses the correct dedicated
Kubernetes namespace following the naming convention.

tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/test_case.yaml

- kafka

- runbooks

- medium# gpt-4o does not get to the root cause which is slower consumer due to email processing.

- medium

Copy link

Contributor

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that the "medium" tag is declared in pyproject.toml.

Per the learnings, only pytest markers declared in pyproject.toml should be used. Confirm that "medium" is a declared marker.

Based on learnings.

🏁 Script executed:

#!/bin/bash
# Description: Check if "medium" is declared as a pytest marker in pyproject.toml
# Search for pytest marker declarations including "medium"
rg -A 20 '^\[tool\.pytest\.ini_options\]' pyproject.toml | rg -i 'markers.*=|medium'

Length of output: 104

Next: Verify namespace and pod naming in kafka-manifest.yaml

🏁 Script executed:

#!/bin/bash
# Locate the manifest file and inspect namespace and pod names
manifest=$(fd --extension yaml tests/llm/fixtures/test_ask_holmes/55_kafka_runbook | grep -v "test_case.yaml")
echo "Inspecting manifest: $manifest"
grep -n "namespace:" "$manifest"
grep -nP "metadata:\s*name:" "$manifest"

Length of output: 599

Next: Inspect namespace and pod names in all YAML manifests

🏁 Script executed:

#!/bin/bash
# List all YAML files in the kafka runbook fixture directory
files=$(fd . tests/llm/fixtures/test_ask_holmes/55_kafka_runbook -e yaml)
echo "Found YAML files:"
echo "$files"
# Inspect each for namespace and pod naming
for f in $files; do
 echo "---- $f ----"
 grep -n "namespace:" "$f" || echo "No namespace found"
 grep -nP "metadata:\s*$" -A2 "$f" | grep -n "name:" || echo "No metadata.name found"
done

Length of output: 1107

Fix namespace and duplicate pod names in kafka-manifest.yaml

Namespace must follow app- (e.g. app-55) instead of ask-holmes-namespace-55 (in tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/app/kafka-manifest.yaml).
Duplicate metadata.name "zookeeper" found; ensure each pod/service name is unique.

🤖 Prompt for AI Agents

In tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/test_case.yaml around
lines 6-6, the test references a manifest that contains incorrect namespace and
duplicate pod/service names; open
tests/llm/fixtures/test_ask_holmes/55_kafka_runbook/app/kafka-manifest.yaml and
change the namespace to the required pattern app-55 (replace
ask-holmes-namespace-55), then ensure every metadata.name in pods/services is
unique (rename duplicate "zookeeper" resources to distinct names like
zookeeper-0 / zookeeper-service or similar) and update any matching selectors,
service names, env vars, and references accordingly so names remain consistent
across manifests.

aantn and others added 2 commits

October 8, 2025 20:16

@aantn


 Merge branch 'master' into eval-tags-and-improvements

849cab0

@aantn


 Update test_case.yaml

9bf3fc2

@github-actions GitHub Actions

Copy link

Contributor

github-actions bot commented Oct 8, 2025

Results of HolmesGPT evals

ask_holmes: 32/35 test cases were successful, 1 regressions, 1 setup failures

Test suite	Test case	Status
ask	01_how_many_pods	✅
ask	02_what_is_wrong_with_pod	✅
ask	04_related_k8s_events	✅
ask	05_image_version	✅
ask	09_crashpod	✅
ask	10_image_pull_backoff	✅
ask	110_k8s_events_image_pull	✅
ask	11_init_containers	✅
ask	13a_pending_node_selector_basic	✅
ask	14_pending_resources	✅
ask	15_failed_readiness_probe	✅
ask	17_oom_kill	✅
ask	19_detect_missing_app_details	✅
ask	20_long_log_file_search	✅
ask	24_misconfigured_pvc	✅
ask	24a_misconfigured_pvc_basic	✅
ask	28_permissions_error	🚧
ask	39_failed_toolset	✅
ask	41_setup_argo	✅
ask	42_dns_issues_steps_new_tools	⚠️
ask	43_current_datetime_from_prompt	✅
ask	45_fetch_deployment_logs_simple	✅
ask	51_logs_summarize_errors	✅
ask	53_logs_find_term	✅
ask	54_not_truncated_when_getting_pods	✅
ask	59_label_based_counting	✅
ask	60_count_less_than	❌
ask	61_exact_match_counting	✅
ask	63_fetch_error_logs_no_errors	✅
ask	79_configmap_mount_issue	✅
ask	83_secret_not_found	✅
ask	86_configmap_like_but_secret	✅
ask	93_calling_datadog[0]	✅
ask	93_calling_datadog[1]	✅
ask	93_calling_datadog[2]	✅

Legend

✅ the test was successful
:minus: the test was skipped
⚠️ the test failed but is known to be flaky or known to fail
🚧 the test had a setup failure (not a code regression)
🔧 the test failed due to mock data issues (not a code regression)
🚫 the test was throttled by API rate limits/overload
❌ the test failed and should be fixed before merging the PR

Labels

None yet

1 participant

@aantn

New evals + eval tagging #1015

Are you sure you want to change the base?

New evals + eval tagging #1015

Conversation

@aantn aantn commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 8, 2025

Results of HolmesGPT evals

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Sep 30, 2025 •

edited

Loading