Adding AEF Evaluation #523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

aminikha wants to merge 5 commits into awslabs:main

from aminikha:main

Open

Adding AEF Evaluation #523

aminikha wants to merge 5 commits into awslabs:main from aminikha:main

Conversation

@aminikha

Copy link

@aminikha aminikha commented Oct 20, 2025

Amazon Bedrock AgentCore Samples Pull Request

Important

We strictly follow a issue-first approach, please first open an issue relating to this Pull Request.
Once this Pull Request is ready for review please attach review ready label to it. Only PRs with review ready will be reviewed.

Issue number:

Concise description of the PR

Added evaluation folder as aprt of 03-Integrations. Sample Notebook to calculate metrics and test agents deployed in Agentcore.

User experience

Please share what the user experience looks like before and after this change

Checklist

If your change doesn't seem to apply, please leave them unchecked.

[ x] I have reviewed the contributing guidelines
[ x] Add your name to CONTRIBUTORS.md
[ x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?
Are you uploading a dataset? No
[x ] Have you documented Introduction, Architecture Diagram, Prerequisites, Usage, Sample Prompts, and Clean Up steps in your example README?
[ x] I agree to resolve any issues created for this example in the future.
[ x] I have performed a self-review of this change
[ x] Changes have been tested
[ x] Changes are documented

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

aminikha added 2 commits

October 16, 2025 14:27

@aminikha


 add aef evaluation notebook

3e8f471

@aminikha


 fix the readme

397446c

@review-notebook-app

Copy link

review-notebook-app bot commented Oct 20, 2025

Check out this pull request on ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

@github-actions github-actions bot added the 03-integrations label

Oct 20, 2025

@github-actions

Copy link

github-actions bot commented Oct 20, 2025 •

edited

Loading

Latest scan for commit: 576a689 | Updated: 2025年10月24日 05:30:25 UTC

Security Scan Results

Scan Metadata

Project: ASH
Scan executed: 2025年10月24日T05:30:07+00:00
ASH version: 3.0.0

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Column Explanations:

Severity Levels (S/C/H/M/L/I):

Suppressed (S): Security findings that have been explicitly suppressed/ignored and don't affect the scanner's pass/fail status
Critical (C): The most severe security vulnerabilities requiring immediate remediation (e.g., SQL injection, remote code execution)
High (H): Serious security vulnerabilities that should be addressed promptly (e.g., authentication bypasses, privilege escalation)
Medium (M): Moderate security risks that should be addressed in normal development cycles (e.g., weak encryption, input validation issues)
Low (L): Minor security concerns with limited impact (e.g., information disclosure, weak recommendations)
Info (I): Informational findings for awareness with minimal security risk (e.g., code quality suggestions, best practice recommendations)

Other Columns:

Time: Duration taken by each scanner to complete its analysis
Action: Total number of actionable findings at or above the configured severity threshold that require attention

Scanner Results:

PASSED: Scanner found no security issues at or above the configured severity threshold - code is clean for this scanner
FAILED: Scanner found security vulnerabilities at or above the threshold that require attention and remediation
MISSING: Scanner could not run because required dependencies/tools are not installed or available
SKIPPED: Scanner was intentionally disabled or excluded from this scan
ERROR: Scanner encountered an execution error and could not complete successfully

Severity Thresholds (Thresh Column):

CRITICAL: Only Critical severity findings cause scanner to fail
HIGH: High and Critical severity findings cause scanner to fail
MEDIUM (MED): Medium, High, and Critical severity findings cause scanner to fail
LOW: Low, Medium, High, and Critical severity findings cause scanner to fail
ALL: Any finding of any severity level causes scanner to fail

Threshold Source: Values in parentheses indicate where the threshold is configured:

(g) = global: Set in the global_settings section of ASH configuration
(c) = config: Set in the individual scanner configuration section
(s) = scanner: Default threshold built into the scanner itself

Statistics calculation:

All statistics are calculated from the final aggregated SARIF report
Suppressed findings are counted separately and do not contribute to actionable findings
Scanner status is determined by comparing actionable findings to the threshold

Scanner	C	L	Time	Action	Result	Thresh
bandit	7	287	4.7s	7	FAILED	MED (g)
cdk-nag	0	0	32.9s	0	PASSED	MED (g)
cfn-nag	0	0	29ms	0	PASSED	MED (g)
checkov	2	0	7.5s	2	FAILED	MED (g)
detect-secre...	0	0	3.5s	0	PASSED	MED (g)
grype	0	0	32.9s	0	PASSED	MED (g)
npm-audit	0	0	969ms	0	PASSED	MED (g)
opengrep	0	0	<1ms	0	SKIPPED	MED (g)
semgrep	1	0	17.5s	1	FAILED	MED (g)
syft	0	0	2.1s	0	PASSED	MED (g)

Detailed Findings

Show 10 actionable findings

Finding 1: B615

Severity: HIGH
Scanner: bandit
Rule ID: B615
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/embeddings/base.py:234-236

Description:
Unsafe Hugging Face Hub download without revision pinning in from_pretrained()

Code Snippet:

) from exc
 config = AutoConfig.from_pretrained(self.model_name)
 self.is_cross_encoder = bool(

Finding 2: B615

Severity: HIGH
Scanner: bandit
Rule ID: B615
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/metrics/_faithfulness.py:279-283

Description:
Unsafe Hugging Face Hub download without revision pinning in from_pretrained()

Code Snippet:

)
 self.nli_classifier = AutoModelForSequenceClassification.from_pretrained(
 "vectara/hallucination_evaluation_model", trust_remote_code=True
 )
 self.nli_classifier.to(self.device)

Finding 3: B310

Severity: HIGH
Scanner: bandit
Rule ID: B310
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/prompt/multi_modal_prompt.py:201-203

Description:
Audit url open for permitted schemes. Allowing use of file:/ or custom schemes is often unexpected.

Code Snippet:

def download_and_encode_image(self, url):
 with urllib.request.urlopen(url) as response:
 return base64.b64encode(response.read()).decode("utf-8")

Finding 4: B113

Severity: HIGH
Scanner: bandit
Rule ID: B113
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/sdk.py:28-39

Description:
Call to requests without timeout

Code Snippet:

app_token = get_app_token()
 response = requests.post(
 f"{base_url}/api/v1{path}",
 data=data_json_string,
 headers={
 "Content-Type": "application/json",
 "x-app-token": app_token,
 "x-source": RAGAS_API_SOURCE,
 "x-app-version": __version__,
 },
 )
 if response.status_code == 403:

Finding 5: B615

Severity: HIGH
Scanner: bandit
Rule ID: B615
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/benchmarks/benchmark_eval.py:18-20

Description:
Unsafe Hugging Face Hub download without revision pinning in load_dataset()

Code Snippet:

# data
ds = load_dataset("explodinggradients/amnesty_qa", "english_v2")
assert isinstance(ds, DatasetDict)

Finding 6: B615

Severity: HIGH
Scanner: bandit
Rule ID: B615
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/e2e/test_amnesty_in_ci.py:17-19

Description:
Unsafe Hugging Face Hub download without revision pinning in load_dataset()

Code Snippet:

# loading the V2 dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v3")["eval"] # type: ignore

Finding 7: B615

Severity: HIGH
Scanner: bandit
Rule ID: B615
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/e2e/test_fullflow.py:13-15

Description:
Unsafe Hugging Face Hub download without revision pinning in load_dataset()

Code Snippet:

def test_evaluate_e2e():
 ds = load_dataset("explodinggradients/amnesty_qa", "english_v3")["eval"] # type: ignore
 result = evaluate(

Finding 8: CKV_DOCKER_2

Severity: HIGH
Scanner: checkov
Rule ID: CKV_DOCKER_2
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/benchmarks/Dockerfile:1-8

Description:
Ensure that HEALTHCHECK instructions have been added to container images

Code Snippet:

FROM python:3.9-slim
RUN apt-get update && apt-get install -y git make
COPY . /app
WORKDIR /app
RUN pip install -e /app/
ARG OPENAI_API_KEY
ENV OPENAI_API_KEY=$OPENAI_API_KEY
RUN make run-benchmarks

Finding 9: CKV_DOCKER_3

Severity: HIGH
Scanner: checkov
Rule ID: CKV_DOCKER_3
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/tests/benchmarks/Dockerfile:1-8

Description:
Ensure that a user for the container has been created

Code Snippet:

FROM python:3.9-slim
RUN apt-get update && apt-get install -y git make
COPY . /app
WORKDIR /app
RUN pip install -e /app/
ARG OPENAI_API_KEY
ENV OPENAI_API_KEY=$OPENAI_API_KEY
RUN make run-benchmarks

Finding 10: python.lang.security.audit.dynamic-urllib-use-detected.dynamic-urllib-use-detected

Severity: HIGH
Scanner: semgrep
Rule ID: python.lang.security.audit.dynamic-urllib-use-detected.dynamic-urllib-use-detected
Location: 03-integrations/Evaluation/AEF/ragas-evaluation/src/ragas/prompt/multi_modal_prompt.py:202

Description:
Detected a dynamic value being used with urllib. urllib supports 'file://' schemes, so a dynamic value controlled by a malicious actor may allow them to read arbitrary files. Audit uses of urllib calls to ensure user data cannot control the URLs, or consider using the 'requests' library instead.

Code Snippet:

with urllib.request.urlopen(url) as response:

Report generated by Automated Security Helper (ASH) at 2025年10月24日T05:30:01+00:00

@aminikha aminikha marked this pull request as draft

October 20, 2025 21:17

@aminikha aminikha marked this pull request as ready for review

October 20, 2025 21:17

@aminikha aminikha changed the title ~~(削除) Adding AEF Evalu (削除ここまで)~~ (追記) Adding AEF Evaluation (追記ここまで)

Oct 20, 2025

@akshseh akshseh assigned akshseh and mvangara10 and unassigned akshseh

Oct 21, 2025

aminikha added 3 commits

October 23, 2025 17:02

@aminikha


 fix security

afe405e

@aminikha


 Merge branch 'main' into main

2e72c07

Signed-off-by: aminikha <aminikha@amazon.com>

@aminikha


 fixing requirements

576a689

Labels

03-integrations

3 participants

@aminikha @akshseh @mvangara10

Adding AEF Evaluation #523

Are you sure you want to change the base?

Adding AEF Evaluation #523

Uh oh!

Conversation

@aminikha aminikha commented Oct 20, 2025

Amazon Bedrock AgentCore Samples Pull Request

Concise description of the PR

User experience

Checklist

Acknowledgment

Uh oh!

review-notebook-app bot commented Oct 20, 2025

Uh oh!

github-actions bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Scan Results

Scan Metadata

Summary

Scanner Results

Detailed Findings

Finding 1: B615

Finding 2: B615

Finding 3: B310

Finding 4: B113

Finding 5: B615

Finding 6: B615

Finding 7: B615

Finding 8: CKV_DOCKER_2

Finding 9: CKV_DOCKER_3

Finding 10: python.lang.security.audit.dynamic-urllib-use-detected.dynamic-urllib-use-detected

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 20, 2025 •

edited

Loading