We Asked Four AI Systems to Benchmark a CEO's Pay. Only One Was Both Right and Done.

What a head-to-head on a real compensation study taught us about why the model isn't the moat.

Executive compensation benchmarking looks like the perfect job for AI: pull a CEO's pay from the proxy, assemble the disclosed peer group, gather each peer's pay, compute the distribution, and write it up for the compensation committee. So, we ran an experiment. We gave four AI systems the same assignment — a CEO pay benchmark for PTC Inc., built only from SEC filings, with an explicit methodology for which peers to include or exclude — and compared what came back.

The results sorted into four distinct behaviors. Only one of them is something you could put in front of a board.

The four behaviors

1. The free-form model fabricated. A fast, ungrounded chatbot produced a beautiful seven-section memo — confident prose, clean tables, a textbook outlier test. The problem: the peer data was invented. Every peer's pay was a suspiciously round number, contradicted by the actual filings, and the model declared the set "identical to the company's disclosures." It then rated its own work 9.5 out of 10. It looked usable and was, in fact, poison.

2. A careful general model refused. Asked to honor the same rigor, a leading general-purpose model did the honest thing: it verified the one figure it could source, recognized it lacked filing-level access to the rest, and declined to proceed, rather than estimate. Intellectually honest — and commercially useless. A committee needs an answer, not an abstention. Its refusal is, in effect, the problem statement.

3. A frontier deep-research agent got close. Running a frontier model at a high reasoning level, this system did real retrieval, computed every statistic correctly, and — crucially — disclosed that its peer figures came from secondary mirrors of the filings rather than the filings themselves, rating its own confidence "moderate-high" for exactly that reason. Genuinely strong work. But it still slipped: it justified including one Canadian peer by citing a US proxy filing that does not exist. Powerful reasoning, secondary sourcing, and one confident factual invention buried in an otherwise excellent memo.

4. The workflow agent was both right and done. Our AnyQuest workflow agent — running on Sonnet with no extended reasoning — produced a committee-ready memo grounded in primary EDGAR filings, with code-verified statistics and correct provenance judgments on the hard edge cases. It was the only one of the four that was both trustworthy and complete.

The headline is the one that should give every AI buyer pause: the system that won was not running the most powerful model. A mid-tier model inside the right workflow beat a frontier model reasoning at full tilt — because the difference that mattered wasn't intelligence. It was infrastructure.

Why the workflow agent won

It is fully auditable. Every step — which filing was pulled, which figure was extracted, how each statistic was computed — is inspectable. When a number is questioned, you can trace it to its source. The deep-research agent, by contrast, produced an answer you have to trust; the workflow agent produced one you can verify.

It encodes the firm's methodology. The rules for including or excluding a peer — stale fiscal years, partial-year CEOs, one-time mega-grants, founder pay, foreign filers — aren't reinvented per run. They're encoded once, as the consulting firm's own house methodology, and applied the same way every time.

It is repeatable. This is the quiet differentiator. A deep-research agent invents a fresh plan on every run, which introduces randomness: run it twice and you may get two different peer sets and two different conclusions. The workflow agent runs the same methodology across every proxy benchmark. For a consulting firm whose credibility rests on defensible, consistent process, repeatability isn't a nicety — it's the product.

It is cheaper and faster. And it isn't close. Because the workflow agent runs on a mid-tier model with no extended reasoning, it produced the winning memo at 50% of the cost. That's the result worth sitting with: the least expensive system also produced the only board-ready deliverable. The heavy lifting was done by the workflow — grounded data access, deterministic computation, encoded methodology — not by burning tokens on a bigger model. Better outcome, lower cost.

The edge case that proved the point

One peer, Open Text, is a Canadian company. It files annual reports with the SEC in US dollars — but discloses executive pay in a Canadian management information circular, not a US proxy statement. The frontier deep-research agent "resolved" this by inventing a US proxy filing. The workflow agent correctly identified where the pay actually lives and handled it accordingly.

That is the whole thesis in miniature. The hard part of this work isn't writing fluent prose about compensation — every modern model can do that. The hard part is knowing where a number truly comes from, and being right about it every time. Fluency is abundant; grounded, repeatable judgment is not.

The takeaway

A generic chatbot will either confidently make things up or honestly give up. Neither is a deliverable. The value is in the system around the model — grounded data access, deterministic computation, encoded methodology, and end-to-end auditability — that turns a capable model into a dependable analyst. The moat isn't the model. It's the workflow.

Click here to view the report produced by the workflow agent.

We Asked Four AI Systems to Benchmark a CEO's Pay. Only One Was Both Right and Done.

One-Pizza Teams Change Everything

AI-Assisted Consulting Discovery

From Chatbots to Competitive Advantage: The Four Stages of Enterprise AI Maturity