JSON vs Prose: How Structured Specs Transform AI Agent Execution

The Problem

Prose instructions leave room for interpretation

Most AI agent workflows are defined as markdown documents: step-by-step prose that agents read, interpret, and execute. This works, but it introduces variability. Different models read the same instructions differently. Steps get skipped, reordered, or improvised upon. There's no way to validate whether the agent followed the workflow correctly.

📝 Prose Instructions

Natural language workflow documents. Human-readable and flexible, but agents must interpret them. Each model brings its own biases, assumptions, and tendencies. Verification steps are suggested, not enforced. The agent decides what "done" looks like.

⚙️ Structured JSON Specs

Machine-readable workflow specifications with explicit steps, tools, parameters, and mandatory validation blocks. Agents execute them mechanically. Every action has a verification check. The spec defines what "done" looks like, not the agent.

Methodology

How we tested

We took a complex, real-world workflow (new project creation: 29 action steps, 3 human decision gates, multiple tools) and ran it as a dry-run planning exercise across three models in two formats. Each agent received the same scenario and was asked to walk through every step without executing.

Define Scenario

Same fictional project across all 6 runs

Run 6 Agents

3 models × 2 formats (prose & JSON)

Score Results

6 criteria, 0-5 each, max 30 points

Compare

Consistency, accuracy, improvisation

Models tested: Claude Opus 4.6 (Anthropic's flagship reasoning model), Claude Sonnet 4.0 (mid-tier workhorse), and Gemini 3 Flash Preview (#1 on PinchBench at 95.1%). Each model ran the workflow twice: once from prose docs, once from the JSON spec.

Results

The numbers

JSON specs outperformed prose across every model and every scoring criterion. The improvement ranged from 20% (Opus, already strong with prose) to 50% (Gemini 3, which benefited the most from structured guidance).

29.7

JSON average score (out of 30)

21.7

Prose average score (out of 30)

1 pt

Cross-model variance (JSON)
vs 5 pts for prose

Overall Scores by Model

Prose (red) vs JSON (green) — out of 30 points

Score Breakdown

Each criterion scored 0–5. Every point matters.

swipe to explore

Criterion	Opus Prose	Opus JSON	Sonnet Prose	Sonnet JSON	Gemini3 Prose	Gemini3 JSON
Step Coverage	5	5	4	5	4	5
Step Order	5	5	4	5	4	5
Gate Compliance	5	5	4	5	4	5
Validation	4	5	2	4	2	5
No Improvisation	2	5	3	5	3	5
Tool Accuracy	4	5	3	5	3	5
TOTAL	25	30	20	29	20	30

Prose Performance

Each model's profile across 6 criteria

JSON Performance

Models converge to near-perfect scores

Key Findings

What the data tells us

Three patterns emerged that explain why structured specs win.

🎯

Zero improvisation with JSON specs

Every prose run invented extra steps. Opus added 10 steps that weren't in the workflow. Sonnet added 5. Gemini 3 added 4. Each model improvised differently, making outputs unpredictable. With JSON specs: all three models produced zero invented steps. The spec is the spec. Nothing more, nothing less.

✅

Validation compliance jumped from 40% to 97%

Prose runs averaged 12 out of 29 verification checks. Agents naturally skip verification when it's suggested in prose. The JSON schema makes validation mandatory on every action step. Result: JSON runs averaged 27/29 verifications. When verification is structural, not optional, agents do it.

🔄

Different models produce identical outputs

With prose, model scores ranged from 20 to 25 (5-point spread). Each model interpreted the same instructions differently. With JSON, scores ranged from 29 to 30 (1-point spread). The structured format acts as an equalizer: it doesn't matter which model you use, the output is consistent.

Improvement by Model

Which models benefit most from structured specs?

Insight: The models that benefit most from JSON specs are the ones that are least familiar with your specific tooling. Gemini 3 Flash improved by 50%. This suggests structured specs are especially valuable when you're using models across different providers, or when you can't guarantee which model will execute a workflow.

The Approach

What a structured spec looks like

Instead of prose paragraphs, each workflow step becomes a structured object with explicit actions, tool calls, and mandatory validation. Here's a simplified example:

{
  "id": "create-github-repo",
  "name": "Create private GitHub repo",
  "type": "action",
  "action": {
    "tool": "exec",
    "command": "gh repo create org/project-name --private"
  },
  "validation": {                // ← mandatory, schema-enforced
    "check": "GitHub repo exists and is accessible",
    "command": "gh repo view org/project-name --json name",
    "expected": "Repo name returned"
  },
  "on_failure": {
    "action": "alert_human",
    "message": "Failed to create repo. Check auth."
  }
}
        

Five step types cover all workflow patterns: action (do something), gate (wait for human input), condition (branch on state), loop (iterate), and group (organize sub-steps). The JSON Schema enforces that every action step includes a validation block. If it's missing, the spec fails validation before it can be used.

Tradeoffs

What you give up

Structured specs aren't free. Here's what to consider before adopting them.

📦 Larger context

JSON specs are about 26% larger than equivalent prose (structural overhead from keys, braces, and quotes). For our test workflow: 26KB JSON vs 16KB prose. This uses more tokens per request. The tradeoff is worth it if consistency matters more than token cost.

🔧 Maintenance

You now have two formats to maintain: prose docs for humans, JSON specs for agents. When a workflow changes, both need updating. We mitigate this by treating prose as the source of truth and deriving JSON from it, but it's still extra work.

🎨 Flexibility

Prose handles edge cases gracefully. "Use your judgment" works in prose but not in JSON. We handle this with an agent tool type and notes fields for prose context within structured steps. Not every workflow should be fully structured.

⏱️ Authoring time

Writing a JSON spec takes longer than writing a markdown doc. The pilot took about 30 minutes to convert. But it's a one-time cost per workflow, and the consistency returns are immediate and compounding.

The Cost

What does this extra structure cost?

JSON specs are structurally larger than prose. More keys, more braces, more quotes. That means more input tokens per workflow execution. Here's the honest breakdown.

~3,900

Prose tokens (input)

~8,200

JSON tokens (input)

+110%

Token increase

Cost per Workflow Execution

Extra input token cost by model tier. Output tokens are excluded (roughly equal for both formats).

swipe to explore

Model	Input Price	Prose Cost	JSON Cost	Extra per Run	Extra per 100 Runs
Claude Opus 4.6	$15 / M tokens	$0.059	$0.123	+$0.065	$6.46
Claude Sonnet 4.0	$3 / M tokens	$0.012	$0.025	+$0.013	$1.29
Gemini 3 Flash	$0.10 / M tokens	$0.0004	$0.0008	+$0.0004	$0.04

Cost vs Quality Tradeoff

Extra cost per run vs score improvement — is it worth it?

The math: On Opus (worst case), you pay an extra 6.5 cents per workflow run for a 20% quality improvement. On Gemini 3 Flash, the extra cost is effectively zero ($0.0004) for a 50% quality improvement. For most teams, the cost of an agent improvising and doing the wrong thing far exceeds the cost of a few thousand extra input tokens.

The Takeaway

Structure reduces ambiguity. Ambiguity is the enemy of consistency.

If you're building AI agent workflows that need to produce reliable, repeatable results across different models and sessions, structured JSON specs are worth the investment. The data is clear: agents follow structured instructions more faithfully than prose, verify their work more often, and improvise less. The format is the guardrail.