Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix(metrics): route calculate_factuality parse through JSONHandler (ACC no longer collapses to 0.25·cosine under gpt-4o-mini)#56

Open
henrik-dahlberg wants to merge 1 commit into
GraphRAG-Bench:main from
henrik-dahlberg:fix/factuality-json-parse
Open

fix(metrics): route calculate_factuality parse through JSONHandler (ACC no longer collapses to 0.25·cosine under gpt-4o-mini) #56
henrik-dahlberg wants to merge 1 commit into
GraphRAG-Bench:main from
henrik-dahlberg:fix/factuality-json-parse

Conversation

@henrik-dahlberg

@henrik-dahlberg henrik-dahlberg commented Jun 14, 2026

Copy link
Copy Markdown

Bug

Evaluation/metrics/answer_accuracy.py::calculate_factuality parses the classifier reply with bare json.loads(response.content) inside try/except: return 0.0. The default judge (gpt-4o-mini, temperature=0, no JSON mode — as generation_eval.py builds it) returns the JSON wrapped as Output: {...}\n\nReasoning: ..., so json.loads raises JSONDecodeError and factuality silently returns 0.0 for every sample. ACC = 0.75·factuality + 0.25·cosine therefore degenerates to 0.25·cosine and cannot separate correct from incorrect answers.

This is the only LLM-parse site in the metrics not using JSONHandler.parse_with_fallbacksgenerate_statements (same file), coverage.py, and faithfulness.py all route through it and are unaffected.

Fix

Route the factuality parse through the same handler (already imported at the top of the file):

parsed = await JSONHandler().parse_with_fallbacks(response.content)
classification = ClassificationWithReason(**parsed)

Verification

compute_answer_correctness on a perfect-match pair, with the repo's own config (gpt-4o-mini + BAAI/bge-large-en-v1.5, CPU):

compute_answer_correctness(
 "What did the cat do?",
 "The cat sat on the mat.",
 "The cat sat on the mat.", # answer == ground_truth
)
ACC (perfect match, expected ~1.0)
before (json.loads) 0.25
after (parse_with_fallbacks) 1.00

calculate_factuality parsed the classifier reply with bare
json.loads(response.content) inside `except: return 0.0`. Under the default
judge (gpt-4o-mini, temperature 0, no JSON mode) the reply is wrapped as
"Output: {...}\n\nReasoning: ...", so json.loads raises and factuality
silently returns 0.0 for every sample -> ACC collapses to 0.25*cosine and
cannot separate correct from incorrect answers.
It was the only LLM-parse site in the metrics not using
JSONHandler.parse_with_fallbacks (generate_statements in the same file,
coverage.py, and faithfulness.py all use it). Route it through the same handler.
Repro (repo config, perfect-match pair) compute_answer_correctness(
"What did the cat do?", "The cat sat on the mat.", "The cat sat on the mat.")
before: 0.25 after: 1.00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /