Prose instructions leave room for interpretation
Most AI agent workflows are defined as markdown documents: step-by-step prose that agents read, interpret, and execute. This works, but it introduces variability. Different models read the same instructions differently. Steps get skipped, reordered, or improvised upon. There's no way to validate whether the agent followed the workflow correctly.
๐ Prose Instructions
Natural language workflow documents. Human-readable and flexible, but agents must interpret them. Each model brings its own biases, assumptions, and tendencies. Verification steps are suggested, not enforced. The agent decides what "done" looks like.
โ๏ธ Structured JSON Specs
Machine-readable workflow specifications with explicit steps, tools, parameters, and mandatory validation blocks. Agents execute them mechanically. Every action has a verification check. The spec defines what "done" looks like, not the agent.
How we tested
We took a complex, real-world workflow (new project creation: 29 action steps, 3 human decision gates, multiple tools) and ran it as a dry-run planning exercise across three models in two formats. Each agent received the same scenario and was asked to walk through every step without executing.
Define Scenario
Same fictional project across all 6 runs
Run 6 Agents
3 models ร 2 formats (prose & JSON)
Score Results
6 criteria, 0-5 each, max 30 points
Compare
Consistency, accuracy, improvisation
Models tested: Claude Opus 4.6 (Anthropic's flagship reasoning model), Claude Sonnet 4.0 (mid-tier workhorse), and Gemini 3 Flash Preview (#1 on PinchBench at 95.1%). Each model ran the workflow twice: once from prose docs, once from the JSON spec.
The numbers
JSON specs outperformed prose across every model and every scoring criterion. The improvement ranged from 20% (Opus, already strong with prose) to 50% (Gemini 3, which benefited the most from structured guidance).
vs 5 pts for prose
| Criterion | Opus Prose | Opus JSON | Sonnet Prose | Sonnet JSON | Gemini3 Prose | Gemini3 JSON |
|---|---|---|---|---|---|---|
| Step Coverage | 5 | 5 | 4 | 5 | 4 | 5 |
| Step Order | 5 | 5 | 4 | 5 | 4 | 5 |
| Gate Compliance | 5 | 5 | 4 | 5 | 4 | 5 |
| Validation | 4 | 5 | 2 | 4 | 2 | 5 |
| No Improvisation | 2 | 5 | 3 | 5 | 3 | 5 |
| Tool Accuracy | 4 | 5 | 3 | 5 | 3 | 5 |
| TOTAL | 25 | 30 | 20 | 29 | 20 | 30 |
What the data tells us
Three patterns emerged that explain why structured specs win.
Zero improvisation with JSON specs
Every prose run invented extra steps. Opus added 10 steps that weren't in the workflow. Sonnet added 5. Gemini 3 added 4. Each model improvised differently, making outputs unpredictable. With JSON specs: all three models produced zero invented steps. The spec is the spec. Nothing more, nothing less.
Validation compliance jumped from 40% to 97%
Prose runs averaged 12 out of 29 verification checks. Agents naturally skip verification when it's suggested in prose. The JSON schema makes validation mandatory on every action step. Result: JSON runs averaged 27/29 verifications. When verification is structural, not optional, agents do it.
Different models produce identical outputs
With prose, model scores ranged from 20 to 25 (5-point spread). Each model interpreted the same instructions differently. With JSON, scores ranged from 29 to 30 (1-point spread). The structured format acts as an equalizer: it doesn't matter which model you use, the output is consistent.
Insight: The models that benefit most from JSON specs are the ones that are least familiar with your specific tooling. Gemini 3 Flash improved by 50%. This suggests structured specs are especially valuable when you're using models across different providers, or when you can't guarantee which model will execute a workflow.
What a structured spec looks like
Instead of prose paragraphs, each workflow step becomes a structured object with explicit actions, tool calls, and mandatory validation. Here's a simplified example:
Five step types cover all workflow patterns: action (do something), gate (wait for human input), condition (branch on state), loop (iterate), and group (organize sub-steps). The JSON Schema enforces that every action step includes a validation block. If it's missing, the spec fails validation before it can be used.
What you give up
Structured specs aren't free. Here's what to consider before adopting them.
๐ฆ Larger context
JSON specs are about 26% larger than equivalent prose (structural overhead from keys, braces, and quotes). For our test workflow: 26KB JSON vs 16KB prose. This uses more tokens per request. The tradeoff is worth it if consistency matters more than token cost.
๐ง Maintenance
You now have two formats to maintain: prose docs for humans, JSON specs for agents. When a workflow changes, both need updating. We mitigate this by treating prose as the source of truth and deriving JSON from it, but it's still extra work.
๐จ Flexibility
Prose handles edge cases gracefully. "Use your judgment" works in prose but not in JSON.
We handle this with an agent tool type and notes fields
for prose context within structured steps. Not every workflow should be fully structured.
โฑ๏ธ Authoring time
Writing a JSON spec takes longer than writing a markdown doc. The pilot took about 30 minutes to convert. But it's a one-time cost per workflow, and the consistency returns are immediate and compounding.
What does this extra structure cost?
JSON specs are structurally larger than prose. More keys, more braces, more quotes. That means more input tokens per workflow execution. Here's the honest breakdown.
| Model | Input Price | Prose Cost | JSON Cost | Extra per Run | Extra per 100 Runs |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $15 / M tokens | $0.059 | $0.123 | +$0.065 | $6.46 |
| Claude Sonnet 4.0 | $3 / M tokens | $0.012 | $0.025 | +$0.013 | $1.29 |
| Gemini 3 Flash | $0.10 / M tokens | $0.0004 | $0.0008 | +$0.0004 | $0.04 |
The math: On Opus (worst case), you pay an extra 6.5 cents per workflow run for a 20% quality improvement. On Gemini 3 Flash, the extra cost is effectively zero ($0.0004) for a 50% quality improvement. For most teams, the cost of an agent improvising and doing the wrong thing far exceeds the cost of a few thousand extra input tokens.
Structure reduces ambiguity. Ambiguity is the enemy of consistency.
If you're building AI agent workflows that need to produce reliable, repeatable results across different models and sessions, structured JSON specs are worth the investment. The data is clear: agents follow structured instructions more faithfully than prose, verify their work more often, and improvise less. The format is the guardrail.