Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Adding AEF Evaluation #523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aminikha wants to merge 5 commits into awslabs:main
base: main
Choose a base branch
Loading
from aminikha:main
Open

Adding AEF Evaluation #523

aminikha wants to merge 5 commits into awslabs:main from aminikha:main

Conversation

@aminikha
Copy link

@aminikha aminikha commented Oct 20, 2025

Amazon Bedrock AgentCore Samples Pull Request

Important

  1. We strictly follow a issue-first approach, please first open an issue relating to this Pull Request.
  2. Once this Pull Request is ready for review please attach review ready label to it. Only PRs with review ready will be reviewed.

Issue number:

Concise description of the PR

Added evaluation folder as aprt of 03-Integrations. Sample Notebook to calculate metrics and test agents deployed in Agentcore.

User experience

Please share what the user experience looks like before and after this change

Checklist

If your change doesn't seem to apply, please leave them unchecked.

  • [ x] I have reviewed the contributing guidelines
  • [ x] Add your name to CONTRIBUTORS.md
  • [ x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • Are you uploading a dataset? No
  • [x ] Have you documented Introduction, Architecture Diagram, Prerequisites, Usage, Sample Prompts, and Clean Up steps in your example README?
  • [ x] I agree to resolve any issues created for this example in the future.
  • [ x] I have performed a self-review of this change
  • [ x] Changes have been tested
  • [ x] Changes are documented

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Copy link

Check out this pull request on ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added the 03-integrations 03-integrations label Oct 20, 2025
Copy link

github-actions bot commented Oct 20, 2025
edited
Loading

Latest scan for commit: 576a689 | Updated: 2025年10月24日 05:30:25 UTC

Security Scan Results

Scan Metadata

  • Project: ASH
  • Scan executed: 2025年10月24日T05:30:07+00:00
  • ASH version: 3.0.0

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Column Explanations:

Severity Levels (S/C/H/M/L/I):

  • Suppressed (S): Security findings that have been explicitly suppressed/ignored and don't affect the scanner's pass/fail status
  • Critical (C): The most severe security vulnerabilities requiring immediate remediation (e.g., SQL injection, remote code execution)
  • High (H): Serious security vulnerabilities that should be addressed promptly (e.g., authentication bypasses, privilege escalation)
  • Medium (M): Moderate security risks that should be addressed in normal development cycles (e.g., weak encryption, input validation issues)
  • Low (L): Minor security concerns with limited impact (e.g., information disclosure, weak recommendations)
  • Info (I): Informational findings for awareness with minimal security risk (e.g., code quality suggestions, best practice recommendations)

Other Columns:

  • Time: Duration taken by each scanner to complete its analysis
  • Action: Total number of actionable findings at or above the configured severity threshold that require attention

Scanner Results:

  • PASSED: Scanner found no security issues at or above the configured severity threshold - code is clean for this scanner
  • FAILED: Scanner found security vulnerabilities at or above the threshold that require attention and remediation
  • MISSING: Scanner could not run because required dependencies/tools are not installed or available
  • SKIPPED: Scanner was intentionally disabled or excluded from this scan
  • ERROR: Scanner encountered an execution error and could not complete successfully

Severity Thresholds (Thresh Column):

  • CRITICAL: Only Critical severity findings cause scanner to fail
  • HIGH: High and Critical severity findings cause scanner to fail
  • MEDIUM (MED): Medium, High, and Critical severity findings cause scanner to fail
  • LOW: Low, Medium, High, and Critical severity findings cause scanner to fail
  • ALL: Any finding of any severity level causes scanner to fail

Threshold Source: Values in parentheses indicate where the threshold is configured:

  • (g) = global: Set in the global_settings section of ASH configuration
  • (c) = config: Set in the individual scanner configuration section
  • (s) = scanner: Default threshold built into the scanner itself

Statistics calculation:

  • All statistics are calculated from the final aggregated SARIF report
  • Suppressed findings are counted separately and do not contribute to actionable findings
  • Scanner status is determined by comparing actionable findings to the threshold
Scanner S C H M L I Time Action Result Thresh
bandit 0 7 0 0 287 0 4.7s 7 FAILED MED (g)
cdk-nag 0 0 0 0 0 0 32.9s 0 PASSED MED (g)
cfn-nag 0 0 0 0 0 0 29ms 0 PASSED MED (g)
checkov 0 2 0 0 0 0 7.5s 2 FAILED MED (g)
detect-secre... 0 0 0 0 0 0 3.5s 0 PASSED MED (g)
grype 0 0 0 0 0 0 32.9s 0 PASSED MED (g)
npm-audit 0 0 0 0 0 0 969ms 0 PASSED MED (g)
opengrep 0 0 0 0 0 0 <1ms 0 SKIPPED MED (g)
semgrep 0 1 0 0 0 0 17.5s 1 FAILED MED (g)
syft 0 0 0 0 0 0 2.1s 0 PASSED MED (g)

Detailed Findings

Show 10 actionable findings

Finding 1: B615

  • Severity: HIGH
  • Scanner: bandit
  • Rule ID: B615
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/embeddings/base.py:234-236

Description:
Unsafe Hugging Face Hub download without revision pinning in from_pretrained()

Code Snippet:

) from exc
 config = AutoConfig.from_pretrained(self.model_name)
 self.is_cross_encoder = bool(

Finding 2: B615

  • Severity: HIGH
  • Scanner: bandit
  • Rule ID: B615
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/metrics/_faithfulness.py:279-283

Description:
Unsafe Hugging Face Hub download without revision pinning in from_pretrained()

Code Snippet:

)
 self.nli_classifier = AutoModelForSequenceClassification.from_pretrained(
 "vectara/hallucination_evaluation_model", trust_remote_code=True
 )
 self.nli_classifier.to(self.device)

Finding 3: B310

  • Severity: HIGH
  • Scanner: bandit
  • Rule ID: B310
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/prompt/multi_modal_prompt.py:201-203

Description:
Audit url open for permitted schemes. Allowing use of file:/ or custom schemes is often unexpected.

Code Snippet:

def download_and_encode_image(self, url):
 with urllib.request.urlopen(url) as response:
 return base64.b64encode(response.read()).decode("utf-8")

Finding 4: B113

  • Severity: HIGH
  • Scanner: bandit
  • Rule ID: B113
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/sdk.py:28-39

Description:
Call to requests without timeout

Code Snippet:

app_token = get_app_token()
 response = requests.post(
 f"{base_url}/api/v1{path}",
 data=data_json_string,
 headers={
 "Content-Type": "application/json",
 "x-app-token": app_token,
 "x-source": RAGAS_API_SOURCE,
 "x-app-version": __version__,
 },
 )
 if response.status_code == 403:

Finding 5: B615

  • Severity: HIGH
  • Scanner: bandit
  • Rule ID: B615
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/benchmarks/benchmark_eval.py:18-20

Description:
Unsafe Hugging Face Hub download without revision pinning in load_dataset()

Code Snippet:

# data
ds = load_dataset("explodinggradients/amnesty_qa", "english_v2")
assert isinstance(ds, DatasetDict)

Finding 6: B615

  • Severity: HIGH
  • Scanner: bandit
  • Rule ID: B615
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/e2e/test_amnesty_in_ci.py:17-19

Description:
Unsafe Hugging Face Hub download without revision pinning in load_dataset()

Code Snippet:

# loading the V2 dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v3")["eval"] # type: ignore

Finding 7: B615

  • Severity: HIGH
  • Scanner: bandit
  • Rule ID: B615
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/e2e/test_fullflow.py:13-15

Description:
Unsafe Hugging Face Hub download without revision pinning in load_dataset()

Code Snippet:

def test_evaluate_e2e():
 ds = load_dataset("explodinggradients/amnesty_qa", "english_v3")["eval"] # type: ignore
 result = evaluate(

Finding 8: CKV_DOCKER_2

  • Severity: HIGH
  • Scanner: checkov
  • Rule ID: CKV_DOCKER_2
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/benchmarks/Dockerfile:1-8

Description:
Ensure that HEALTHCHECK instructions have been added to container images

Code Snippet:

FROM python:3.9-slim
RUN apt-get update && apt-get install -y git make
COPY . /app
WORKDIR /app
RUN pip install -e /app/
ARG OPENAI_API_KEY
ENV OPENAI_API_KEY=$OPENAI_API_KEY
RUN make run-benchmarks

Finding 9: CKV_DOCKER_3

  • Severity: HIGH
  • Scanner: checkov
  • Rule ID: CKV_DOCKER_3
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/benchmarks/Dockerfile:1-8

Description:
Ensure that a user for the container has been created

Code Snippet:

FROM python:3.9-slim
RUN apt-get update && apt-get install -y git make
COPY . /app
WORKDIR /app
RUN pip install -e /app/
ARG OPENAI_API_KEY
ENV OPENAI_API_KEY=$OPENAI_API_KEY
RUN make run-benchmarks

Finding 10: python.lang.security.audit.dynamic-urllib-use-detected.dynamic-urllib-use-detected

  • Severity: HIGH
  • Scanner: semgrep
  • Rule ID: python.lang.security.audit.dynamic-urllib-use-detected.dynamic-urllib-use-detected
  • Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/prompt/multi_modal_prompt.py:202

Description:
Detected a dynamic value being used with urllib. urllib supports 'file://' schemes, so a dynamic value controlled by a malicious actor may allow them to read arbitrary files. Audit uses of urllib calls to ensure user data cannot control the URLs, or consider using the 'requests' library instead.

Code Snippet:

with urllib.request.urlopen(url) as response:

Report generated by Automated Security Helper (ASH) at 2025年10月24日T05:30:01+00:00

@aminikha aminikha marked this pull request as draft October 20, 2025 21:17
@aminikha aminikha marked this pull request as ready for review October 20, 2025 21:17
@aminikha aminikha changed the title (削除) Adding AEF Evalu (削除ここまで) (追記) Adding AEF Evaluation (追記ここまで) Oct 20, 2025
@akshseh akshseh assigned akshseh and mvangara10 and unassigned akshseh Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Labels

03-integrations 03-integrations

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /