When AI Agents Argue: What One Adversarial Reviewer Caught That Four Collaborative AIs Missed

The Premise

What happens when you deploy seven specialized AI agents on a long-term research project — and one of them is specifically designed to tear the others’ work apart?

From October 2025 through January 2026, we ran exactly that experiment. The subject matter is biblical chronology — synchronizing ancient timelines across civilizations using astronomical records as anchor points. But the real story isn’t the chronology. It’s what we learned about how AI agents fail, how they compensate for each other’s blind spots, and why a single adversarial reviewer can be worth more than four agreeable ones.

The Research Question

The project digitizes Martin Anstey’s The Romance of Bible Chronology (1913) — a comprehensive but controversial work arguing that conventional ancient chronology contains a ~109-year error in the Persian period. Anstey claims that Ptolemy’s astronomical canon, universally accepted as the backbone of ancient dating, fabricated or duplicated approximately a century of Persian kings.

The research question: Can independent astronomical evidence — eclipse records, orbital mechanics, ancient Babylonian observation tablets — validate or refute this claim?

This requires deep expertise across multiple domains simultaneously: ancient languages, orbital mechanics, biblical textual criticism, archaeological dating methods, and statistical analysis. No single researcher — human or AI — commands all of these. That’s what made it a perfect test case for multi-agent architecture.

Agent Architecture

We deployed seven Claude instances, each in its own Git repository with distinct rules files, access permissions, and behavioral constraints. The architecture is hub-and-spoke, with a human project lead as final authority and a Strategic Claude coordinating day-to-day work.

Agent	Role
Strategic Claude (SC)	Methodology design, quality review, research coordination
Code Claude (CC)	Database extraction, SQL schemas, validation scripts
Textual Analysis Claude (TAC)	Extract arguments and reasoning from source texts
Critique Claude (CQC)	Adversarial reviewer — actively tries to falsify findings
Astronomical Claude (AC)	Eclipse calculations, orbital mechanics, ancient record analysis
UI Claude (UIC)	Flask web interface for human data validation
Validation Claude (VC)	Row-by-row data audit against source images

Each agent operates under a .clinerules file — a behavioral contract ranging from 400 to 1,300 lines — specifying what the agent can and cannot do, how it must communicate, what quality standards it must meet, and when it must escalate to the human lead.

                    Chris (Final Authority)
                         |
           Strategic Claude (SC) — Reviews, Approves, Coordinates
          /        |                      
    Code Claude   TAC         AC            UI Claude
    (Executor)    (Scholar)   (Astronomer)  (Interface)
                  |          /
          Critique Claude (CQC) — Adversarial Review of ALL

Communication happens through git commits. Each agent pushes findings to their repository. SC reviews and approves before work proceeds to the next phase. CQC reviews everything before any finding is considered publication-ready.

The critical design decision: CQC is not a collaborator. Its success metric is finding flaws the other agents missed. It operates under explicit instructions to “distrust by default” and “assume findings are wrong until proven right.”

AI Failure Modes: What the Adversarial Agent Uncovered

We cataloged five distinct failure modes during the project — several of which were only caught because the adversarial reviewer existed.

Failure Mode 1: The Echo Chamber Effect

Before this project’s adversarial architecture was established, an earlier phase produced a paper claiming “99.9999% confidence” that Ptolemy’s ancient chronology was calculated rather than observed. Four Claude instances reviewed it:

An Opus instance created the astronomical calculations
A second Opus instance validated the code
A Sonnet instance reviewed the argument
A second Sonnet instance assessed the strategy

All four approved the paper. Final grade: 97/100.

The paper was built on a calendar conversion bug. Julian dates had been fed to a function expecting Gregorian dates. The varying offset (8 days at 721 BC, shrinking to 1 day at AD 134) created an artificial “systematic pattern” with R² = 0.82 — which was the paper’s entire thesis.

When the bug was corrected:

Metric	Buggy	Corrected
Mean error	3.82 days	-0.50 days
R²	0.82	0.11
P-value	< 0.000001	~0.20

The finding evaporated completely. An external reviewer (Grok-3) caught it by cross-checking against Fred Espenak’s NASA Five Millennium Catalog of Lunar Eclipses — something none of the four Claude instances had done.

The lesson: Multiple agents from the same model family share the same blind spots. Each assumed the others had verified the foundations. Nobody checked. This incident directly motivated the creation of CQC.

Failure Mode 2: Distraction by Sophistication

Later, the team developed a calendar-independent orbital mechanics methodology — genuinely innovative work using Saros cycle counting instead of calendar dates. Using this method, they analyzed LBAT *1419 (a Babylonian lunar eclipse tablet) and reported stunning results: “10 out of 10 observations show zero matches at conventional dates while showing multiple matches at +109 year offset.”

CQC initially rated this claim at 72% confidence, calling it “technically valid.”

It wasn’t. A fact-checking agent later discovered that the search algorithm was centered on the target date (+109 years). The search window extended ±60 orbits from that center — but the conventional zone ended 19 orbits before the search window began. The conventional chronology was never actually searched.

The gap was identical across all 10 observations:

Observation	Conventional Zone End	Search Start	Gap
#1	-2606 orbits	-2587 orbits	19 orbits
#8	-2549 orbits	-2530 orbits	19 orbits
#10	-2531 orbits	-2512 orbits	19 orbits

When corrected, the conventional zone contained comparable numbers of eclipse matches — and for one observation, more matches than the alternative.

The lesson: Innovative methodology can mask basic arithmetic errors. The orbital mechanics were genuinely clever, which made reviewers less likely to check the simple math underneath. CQC’s own post-mortem acknowledged this: “The error was discoverable with basic arithmetic. I became one of those collaborative reviewers who approved flawed work.”

Failure Mode 3: Circular Validation

CQC identified five distinct circular reasoning loops. The most illustrative:

Take eclipses believed to be 18 years apart (one Saros cycle)
Check if they’re 18 years apart
They are! (Because you selected them that way)
Conclude: “Saros cycle validates our chronology!”

But Saros cycles work in both chronologies. The cycle is a physical constant — it confirms that eclipses repeat, not that any particular dating is correct.

Similarly, the team reported “85% confidence” in their hypothesis. CQC’s review found no statistical test, no null hypothesis, no p-value, no power analysis. The number was a subjective estimate dressed in the language of statistical rigor.

Failure Mode 4: Assumption Inheritance

Each agent tended to assume that earlier agents had verified foundational claims. When SC approved AC’s eclipse calculations, CC treated them as validated facts. When TAC extracted Anstey’s arguments, other agents cited them without independently checking the source material.

This created dependency chains where errors in early work propagated undetected through multiple downstream analyses — exactly analogous to unverified dependencies in a software supply chain.

Failure Mode 5: Confidence Without Evidence

Multiple agents produced high-confidence claims without corresponding evidentiary support:

“99.9999% confidence” — no external validation performed
“85% confidence” — no statistical test conducted
“10/10 observations match” — search excluded the alternative

The pattern: AI agents readily generate precise-sounding confidence numbers that are actually subjective assessments. These numbers carry rhetorical weight far beyond their evidentiary basis.

Where Human Judgment Proved Essential

The “Truth Trumps Everything” Directive

When the 19-orbit gap error was discovered, the team faced a choice: quietly fix the methodology and hope the results still held, or publicly acknowledge that the previous claims were invalid.

The human lead chose transparency. The research direction was reframed from “proving alternative chronology” to understanding the evidentiary limitations of the source material — a more modest but intellectually honest outcome.

This required a value judgment that AI agents are structurally unable to make. The agents could analyze evidence and produce arguments, but deciding to sacrifice a desired conclusion in favor of methodological integrity was a human call.

Anti-Hallucination Protocol Design

After observing AI agents fabricate plausible-sounding data that passed basic validation, the project lead designed specific counter-protocols: mandatory sample extraction before full runs, mathematical validation of totals, visual spot-checks against source images, and an explicit instruction that “missing data with clear documentation is infinitely more valuable than complete data with hallucinated elements.”

These protocols required understanding both AI vulnerability patterns and domain-specific plausibility — knowing which values are reasonable for ancient chronological data and which should trigger suspicion.

Model Selection as a Meta-Skill

The project lead developed criteria for recognizing when a model was mismatched to a task: three or more iterations without progress, overthinking simple visual extraction, or expressing uncertainty about straightforward decisions. Sonnet for systematic extraction of proven patterns. Opus for novel structure analysis and ambiguity resolution. This meta-cognitive assessment of AI capability is something the AI agents themselves couldn’t reliably perform.

What Worked Well

Independent Convergence

The most compelling result emerged when two completely independent analytical threads — astronomical eclipse analysis and philological textual analysis — arrived at the same approximate error magnitude (~100 years) through entirely different methods.

The textual agent extracted E.W. Bullinger’s “appellative method” from The Companion Bible (1922), identifying ~109 years of potentially duplicated Persian kings through throne-title analysis. Separately, the astronomical agent’s eclipse calculations showed that Ptolemy’s conventional dates produce barely-visible partial eclipses (10-20% magnitude) while dates shifted by ~109 years produce total eclipses. Why would Babylonian astronomers record barely-noticeable partials but ignore spectacular totals? That question emerged organically from multi-agent convergence, not from any single agent’s analysis.

Error Discovery Through Role Separation

The most critical errors were caught not by the agents who made them, but by agents with different mandates:

A fact-checking agent caught the 19-orbit gap that the adversarial agent initially missed
The adversarial agent caught circular reasoning that the strategic agent had approved
An external model (Grok-3) caught the calendar bug that four Claude instances missed

Error detection scales with cognitive diversity, not with the number of agreeing reviewers.

The Adversarial Pivot

The project’s most valuable outcome may be the reframing itself. Adversarial review forced the team to abandon their original thesis and confront the actual evidentiary limits of the source material. The conclusion — that the Babylonian tablet data lacks the precision to validate any specific chronology — is less exciting than the original claim but far more defensible.

Without the adversarial agent, the team would likely have continued building on flawed analysis. With it, they course-corrected toward intellectual honesty.

What We’d Do Differently

More diverse models earlier. The calendar bug proved that multiple instances of the same model family share blind spots. Cross-model review would have caught the error faster.

Adversarial review from day one. CQC was created in response to the calendar bug. If it had existed from the start, the flawed paper would never have reached draft stage.

Formal statistical protocols. Subjective confidence estimates dressed in statistical language are worse than no estimates at all. Future work should either use proper Bayesian analysis or explicitly label assessments as qualitative.

Clearer separation between exploration and validation. Agents doing exploratory analysis sometimes self-validated their own findings before the adversarial reviewer could assess them, creating assumption inheritance chains.

The Takeaway

Multi-agent AI architecture works — not because more agents produce better answers, but because role separation creates cognitive diversity that catches errors collaborative agents share.

The single most valuable agent in this project was the one explicitly designed to disagree. Four collaborative reviewers gave a fatally flawed paper a 97/100. One adversarial reviewer would have killed it in minutes.

The second most valuable participant was the human who decided that truth matters more than confirming a desired conclusion. AI agents can analyze evidence and produce arguments. They cannot decide that intellectual honesty is worth more than a compelling result.

Multi-agent architecture is not a replacement for human judgment. It’s an amplifier — one that works best when at least one agent is actively trying to prove the others wrong.

This research was conducted using Claude (Anthropic) instances deployed across specialized roles with behavioral contracts, version-controlled repositories, and formal review protocols. The project continues.

Have a complex problem that needs more than a single perspective? We build AI-augmented solutions that deliver — and that know when to question their own answers. Let’s talk.

Published by Veritas Supera IT Solutions — veritassuperaitsolutions.com

When AI Agents Argue: What One Adversarial Reviewer Caught That Four Collaborative AIs Missed

The Premise

The Research Question

Agent Architecture

AI Failure Modes: What the Adversarial Agent Uncovered

Failure Mode 1: The Echo Chamber Effect

Failure Mode 2: Distraction by Sophistication

Failure Mode 3: Circular Validation

Failure Mode 4: Assumption Inheritance

Failure Mode 5: Confidence Without Evidence

Where Human Judgment Proved Essential

The “Truth Trumps Everything” Directive

Anti-Hallucination Protocol Design

Model Selection as a Meta-Skill

What Worked Well

Independent Convergence

Error Discovery Through Role Separation

The Adversarial Pivot

What We’d Do Differently

The Takeaway

Adding Test Coverage to 20-Year-Old Koha Code — With a Little Help from AI

We Burned Through 100% of Our Claude Code Quota in Two Hours. Here’s What We Found.

The Premise

The Research Question

Agent Architecture

AI Failure Modes: What the Adversarial Agent Uncovered

Failure Mode 1: The Echo Chamber Effect

Failure Mode 2: Distraction by Sophistication

Failure Mode 3: Circular Validation

Failure Mode 4: Assumption Inheritance

Failure Mode 5: Confidence Without Evidence

Where Human Judgment Proved Essential

The “Truth Trumps Everything” Directive

Anti-Hallucination Protocol Design

Model Selection as a Meta-Skill

What Worked Well

Independent Convergence

Error Discovery Through Role Separation

The Adversarial Pivot

What We’d Do Differently

The Takeaway

Adding Test Coverage to 20-Year-Old Koha Code — With a Little Help from AI

We Burned Through 100% of Our Claude Code Quota in Two Hours. Here’s What We Found.

One careful piece of writing on AI tooling, in your inbox each Friday.