Metric	Baseline (Same Prompt)	Adversarial (Attack)	Impact
Phi Correlation	+0.42	-0.80	Moderate positive → Strong negative
Agreement	70%	10%	Models almost never overlap now
Disagreement	30%	90%	Massive increase in unique signal
Beyond-chance co‐failure	+10%	-20%	Fail together less than random chance
Effective sample size (n_eff)	1.7	4.4	Normalized independence gain

LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4

Different roles give some independence, but not real independence.
— Marco Somma

We ran 50 jailbreak/prompt injection tests on two popular LLMs to measure how correlated their failure modes are. The question: if you use two models as independent validators, do they actually fail differently?

Then, based on a brilliant suggestion by Nazar Boyko, we ran a second experiment with adversarial framing: instead of both models answering the same question, the second model was tasked with attacking the first model's verdict.

The results are dramatic.

📊 Baseline Results (Both models answer the same prompt)

Metric	Value
Phi correlation	0.417
Cohen's kappa	0.400
Agreement	70%
Disagreement	30%
Effective sample size (n_eff)	35.3 (from 50 tests)
Beyond‐chance co‐failure	+10%

Vulnerability rates

Groq (Llama 3.1 8B) : 50% vulnerable
OpenRouter (Gemma 4 31B) : 36% vulnerable

Contingency table (Baseline)

Model B

...

Top comments (1)

ggle_in profile image

HARD IN SOFT OUT

I believe good technical writing should either teach you something or make you feel less alone in the dystopia. Preferably both.

Location

Indonesia
Education

still learning, havent found the end of education yet...
Pronouns

human! hey look... thats human too!
Work

gardener who love to code and write things. please let me know if i can fit the slot on your team.
Joined

Apr 17, 2026

• Jun 13

I was genuinely shocked that changing the prompt structure did more for independence than changing the entire model vendor. Has anyone else experimented with 'Red Teaming' their own validation pipelines? I'm curious if this holds up at much larger scale (e.g., Llama 405B).

And this script is part of my current project LLM Security Audit (will publish soon.)

DEV Community

I Made Two AI Models Fight Each Other. They Agreed Way Too Much.

🎯 The Short Version (for busy seniors)

🧪 The Experiment (What I Actually Did)

📊 The Numbers That Made Me Re-evaluate Everything

🔍 What negative phi actually means

😂 The Dark Joke (Because We All Need It)

🧠 What I Learned (And What You Should Steal)

1. Same Models + Different Tasks = True Independence

2. Correlation is about Alignment Lineage

3. The Real Signal is Disagreement

📦 Open Source Reference

GitHub logo setuju / LLM-Independence-Experiment