Research Results โ€” March 2026

Structured Specs Beat Prose
for AI Agent Execution

We tested whether JSON workflow specifications produce more consistent AI agent execution than traditional prose instructions. We ran the same workflow across three leading models, two formats, and scored the results. The difference was decisive.

+37% average improvement
with structured specs

Prose instructions leave room for interpretation

Most AI agent workflows are defined as markdown documents: step-by-step prose that agents read, interpret, and execute. This works, but it introduces variability. Different models read the same instructions differently. Steps get skipped, reordered, or improvised upon. There's no way to validate whether the agent followed the workflow correctly.

๐Ÿ“ Prose Instructions

Natural language workflow documents. Human-readable and flexible, but agents must interpret them. Each model brings its own biases, assumptions, and tendencies. Verification steps are suggested, not enforced. The agent decides what "done" looks like.

โš™๏ธ Structured JSON Specs

Machine-readable workflow specifications with explicit steps, tools, parameters, and mandatory validation blocks. Agents execute them mechanically. Every action has a verification check. The spec defines what "done" looks like, not the agent.


How we tested

We took a complex, real-world workflow (new project creation: 29 action steps, 3 human decision gates, multiple tools) and ran it as a dry-run planning exercise across three models in two formats. Each agent received the same scenario and was asked to walk through every step without executing.

1

Define Scenario

Same fictional project across all 6 runs

2

Run 6 Agents

3 models ร— 2 formats (prose & JSON)

3

Score Results

6 criteria, 0-5 each, max 30 points

4

Compare

Consistency, accuracy, improvisation

Models tested: Claude Opus 4.6 (Anthropic's flagship reasoning model), Claude Sonnet 4.0 (mid-tier workhorse), and Gemini 3 Flash Preview (#1 on PinchBench at 95.1%). Each model ran the workflow twice: once from prose docs, once from the JSON spec.


The numbers

JSON specs outperformed prose across every model and every scoring criterion. The improvement ranged from 20% (Opus, already strong with prose) to 50% (Gemini 3, which benefited the most from structured guidance).

29.7
JSON average score (out of 30)
21.7
Prose average score (out of 30)
1 pt
Cross-model variance (JSON)
vs 5 pts for prose
Overall Scores by Model
Prose (red) vs JSON (green) โ€” out of 30 points
Score Breakdown
Each criterion scored 0โ€“5. Every point matters.
swipe to explore
Criterion Opus Prose Opus JSON Sonnet Prose Sonnet JSON Gemini3 Prose Gemini3 JSON
Step Coverage 5 5 4 5 4 5
Step Order 5 5 4 5 4 5
Gate Compliance 5 5 4 5 4 5
Validation 4 5 2 4 2 5
No Improvisation 2 5 3 5 3 5
Tool Accuracy 4 5 3 5 3 5
TOTAL 25 30 20 29 20 30
Prose Performance
Each model's profile across 6 criteria
JSON Performance
Models converge to near-perfect scores

What the data tells us

Three patterns emerged that explain why structured specs win.

๐ŸŽฏ

Zero improvisation with JSON specs

Every prose run invented extra steps. Opus added 10 steps that weren't in the workflow. Sonnet added 5. Gemini 3 added 4. Each model improvised differently, making outputs unpredictable. With JSON specs: all three models produced zero invented steps. The spec is the spec. Nothing more, nothing less.

โœ…

Validation compliance jumped from 40% to 97%

Prose runs averaged 12 out of 29 verification checks. Agents naturally skip verification when it's suggested in prose. The JSON schema makes validation mandatory on every action step. Result: JSON runs averaged 27/29 verifications. When verification is structural, not optional, agents do it.

๐Ÿ”„

Different models produce identical outputs

With prose, model scores ranged from 20 to 25 (5-point spread). Each model interpreted the same instructions differently. With JSON, scores ranged from 29 to 30 (1-point spread). The structured format acts as an equalizer: it doesn't matter which model you use, the output is consistent.

Improvement by Model
Which models benefit most from structured specs?

Insight: The models that benefit most from JSON specs are the ones that are least familiar with your specific tooling. Gemini 3 Flash improved by 50%. This suggests structured specs are especially valuable when you're using models across different providers, or when you can't guarantee which model will execute a workflow.


What a structured spec looks like

Instead of prose paragraphs, each workflow step becomes a structured object with explicit actions, tool calls, and mandatory validation. Here's a simplified example:

{ "id": "create-github-repo", "name": "Create private GitHub repo", "type": "action", "action": { "tool": "exec", "command": "gh repo create org/project-name --private" }, "validation": { // โ† mandatory, schema-enforced "check": "GitHub repo exists and is accessible", "command": "gh repo view org/project-name --json name", "expected": "Repo name returned" }, "on_failure": { "action": "alert_human", "message": "Failed to create repo. Check auth." } }

Five step types cover all workflow patterns: action (do something), gate (wait for human input), condition (branch on state), loop (iterate), and group (organize sub-steps). The JSON Schema enforces that every action step includes a validation block. If it's missing, the spec fails validation before it can be used.


What you give up

Structured specs aren't free. Here's what to consider before adopting them.

๐Ÿ“ฆ Larger context

JSON specs are about 26% larger than equivalent prose (structural overhead from keys, braces, and quotes). For our test workflow: 26KB JSON vs 16KB prose. This uses more tokens per request. The tradeoff is worth it if consistency matters more than token cost.

๐Ÿ”ง Maintenance

You now have two formats to maintain: prose docs for humans, JSON specs for agents. When a workflow changes, both need updating. We mitigate this by treating prose as the source of truth and deriving JSON from it, but it's still extra work.

๐ŸŽจ Flexibility

Prose handles edge cases gracefully. "Use your judgment" works in prose but not in JSON. We handle this with an agent tool type and notes fields for prose context within structured steps. Not every workflow should be fully structured.

โฑ๏ธ Authoring time

Writing a JSON spec takes longer than writing a markdown doc. The pilot took about 30 minutes to convert. But it's a one-time cost per workflow, and the consistency returns are immediate and compounding.

What does this extra structure cost?

JSON specs are structurally larger than prose. More keys, more braces, more quotes. That means more input tokens per workflow execution. Here's the honest breakdown.

~3,900
Prose tokens (input)
~8,200
JSON tokens (input)
+110%
Token increase
Cost per Workflow Execution
Extra input token cost by model tier. Output tokens are excluded (roughly equal for both formats).
swipe to explore
Model Input Price Prose Cost JSON Cost Extra per Run Extra per 100 Runs
Claude Opus 4.6 $15 / M tokens $0.059 $0.123 +$0.065 $6.46
Claude Sonnet 4.0 $3 / M tokens $0.012 $0.025 +$0.013 $1.29
Gemini 3 Flash $0.10 / M tokens $0.0004 $0.0008 +$0.0004 $0.04
Cost vs Quality Tradeoff
Extra cost per run vs score improvement โ€” is it worth it?

The math: On Opus (worst case), you pay an extra 6.5 cents per workflow run for a 20% quality improvement. On Gemini 3 Flash, the extra cost is effectively zero ($0.0004) for a 50% quality improvement. For most teams, the cost of an agent improvising and doing the wrong thing far exceeds the cost of a few thousand extra input tokens.


Structure reduces ambiguity. Ambiguity is the enemy of consistency.

If you're building AI agent workflows that need to produce reliable, repeatable results across different models and sessions, structured JSON specs are worth the investment. The data is clear: agents follow structured instructions more faithfully than prose, verify their work more often, and improvise less. The format is the guardrail.