fix(metrics): route calculate_factuality parse through JSONHandler (ACC no longer collapses to 0.25·cosine under gpt-4o-mini)#56

Open

henrik-dahlberg wants to merge 1 commit into

GraphRAG-Bench:main from

henrik-dahlberg:fix/factuality-json-parse

Open

fix(metrics): route calculate_factuality parse through JSONHandler (ACC no longer collapses to 0.25·cosine under gpt-4o-mini) #56
henrik-dahlberg wants to merge 1 commit into
GraphRAG-Bench:main from
henrik-dahlberg:fix/factuality-json-parse

Conversation

@henrik-dahlberg

@henrik-dahlberg henrik-dahlberg commented Jun 14, 2026

Copy link

Copy Markdown

Bug

Evaluation/metrics/answer_accuracy.py::calculate_factuality parses the classifier reply with bare json.loads(response.content) inside try/except: return 0.0. The default judge (gpt-4o-mini, temperature=0, no JSON mode — as generation_eval.py builds it) returns the JSON wrapped as Output: {...}\n\nReasoning: ..., so json.loads raises JSONDecodeError and factuality silently returns 0.0 for every sample. ACC = 0.75·factuality + 0.25·cosine therefore degenerates to 0.25·cosine and cannot separate correct from incorrect answers.

This is the only LLM-parse site in the metrics not using JSONHandler.parse_with_fallbacks — generate_statements (same file), coverage.py, and faithfulness.py all route through it and are unaffected.

Fix

Route the factuality parse through the same handler (already imported at the top of the file):

parsed = await JSONHandler().parse_with_fallbacks(response.content)
classification = ClassificationWithReason(**parsed)

Verification

compute_answer_correctness on a perfect-match pair, with the repo's own config (gpt-4o-mini + BAAI/bge-large-en-v1.5, CPU):

compute_answer_correctness(
 "What did the cat do?",
 "The cat sat on the mat.",
 "The cat sat on the mat.", # answer == ground_truth
)

ACC (perfect match, expected ~1.0)
before (`json.loads`)	0.25
after (`parse_with_fallbacks`)	1.00

@henrik-dahlberg


 fix(metrics): route calculate_factuality parse through JSONHandler

91ccd13

calculate_factuality parsed the classifier reply with bare
json.loads(response.content) inside `except: return 0.0`. Under the default
judge (gpt-4o-mini, temperature 0, no JSON mode) the reply is wrapped as
"Output: {...}\n\nReasoning: ...", so json.loads raises and factuality
silently returns 0.0 for every sample -> ACC collapses to 0.25*cosine and
cannot separate correct from incorrect answers.
It was the only LLM-parse site in the metrics not using
JSONHandler.parse_with_fallbacks (generate_statements in the same file,
coverage.py, and faithfulness.py all use it). Route it through the same handler.
Repro (repo config, perfect-match pair) compute_answer_correctness(
"What did the cat do?", "The cat sat on the mat.", "The cat sat on the mat.")
before: 0.25 after: 1.00

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): route calculate_factuality parse through JSONHandler (ACC no longer collapses to 0.25·cosine under gpt-4o-mini)#56

fix(metrics): route calculate_factuality parse through JSONHandler (ACC no longer collapses to 0.25·cosine under gpt-4o-mini) #56
henrik-dahlberg wants to merge 1 commit into
GraphRAG-Bench:main from
henrik-dahlberg:fix/factuality-json-parse

Conversation

@henrik-dahlberg henrik-dahlberg commented Jun 14, 2026

Bug

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant