Blog

Contract Driven Development

15 min readFeb 9, 2026

Code Review Is Dead

We've been staring at the wrong thing.

For decades, software development has been about code. Write it. Review it. Test it. Refactor it. Debate it in pull requests. Argue about tabs vs spaces. Hire senior engineers specifically because they're good at reading other people's code.

Meanwhile, the actual question—does this thing deliver value?—gets answered months later. Maybe never.

Here's the thing nobody wants to say out loud: developers guiding LLMs can produce a million lines of meaningful code per day. That number is going up, not down. Your senior engineer cannot review a million lines of code. Your team of senior engineers cannot review a million lines of code.

Code review is dead. It just doesn't know it yet.

So what replaces it?

We trust code. We should trust outcomes.

The outcomes are contracts.


Vibes vs Correctness

LLMs got good enough at code that everyone started vibing. Cursor, Copilot, "just let it write the whole file."

And most of it — let's be honest — is slop.

Developers prompting LLMs in tight loops, accepting whatever comes back, shipping it into repos with no structure and no verification. The code works today. It rots tomorrow.

The problem isn't the LLMs. It's the loop.

Without guardrails, every iteration an LLM touches your codebase makes it slightly worse. Hallucinations, version mismatches, subtle logic bugs — death by a thousand cuts. The code decays. Eventually it collapses.

But there's a different path. Teams on the frontier — building structured agentic loops with specs, harnesses, and validation — are seeing the opposite: iterations that compound correctness instead of compounding error.

The math is brutal in both directions:

  • Get 1% better every day for a year: 1.01^365 = 37.78x
  • Get 1% worse every day: 0.99^365 = 0.03

Same starting point. Opposite trajectories.

This isn't theoretical.

StrongDM's AI team built a Software Factory founded on two rules:

  • Code must not be written by humans.
  • Code must not be reviewed by humans.

They ship production security software — the last thing you'd expect to build without code review — and they do it by validating contracts, not reading diffs.

We're still on the frontier. Most teams haven't figured this out yet. But the ones that do will need something other than code review to keep the loop honest.

You need contracts.


A Contract Is Not a Test

A contract is not a test.

A test asks: "Does this code work?"

A contract asks: "What does 'work' mean at this level of abstraction?"

Tests verify implementation. Contracts define reality.

Your server is running. Your API returns the right shape. Your user can complete checkout. Your revenue exceeds your costs.

These aren't test assertions — they're promises that each layer of your system makes to the layer above it.

When all the promises hold, the system delivers value. When one breaks, you know exactly which layer lied.


Three Layers

Think of your system as three layers.

The CDD Hierarchy — three layers: Implementation, Capability, Value

At the bottom: Implementation. Is the system/codebase verifiable? Not "is the code good" — verifiable. Can you prove it's correct, secure, tested, performant, and consistent without reading it? This is where the 7 Contracts live (more on those shortly).

In the middle: Capability. Can the user do the thing? Can they pay? Export their data? Complete the workflow they came for?

At the top: Value. Is it profitable? Scalable? Does it matter?

Each layer depends on the one below it. Each layer makes promises to the one above it.

Those promises are contracts. And they cascade.


Trust Flows Up

Trust flows upward through the three layers

Trust flows upward.

If Implementation contracts pass → trust the codebase.
If Capability contracts pass → trust the product.
If Value contracts pass → trust the business.

If all layers pass, you've proven the system delivers value.

Without reading a single line of code.

Implementation becomes irrelevant.

  • Ralph writes it? Fine.
  • A human writes it? Fine.
  • GPT-7 writes it? Fine.

Same contracts. Different shadows on the cave wall.

You don't review implementation. You validate contracts.


Debug the Contract, Not the Code

Traditional testing proves the code works. It doesn't prove the system delivers value. You can have 100% test coverage and still ship something nobody uses.

CDD changes where you look when something goes wrong.

When an Implementation contract fails — say, mutation testing catches a hollow assertion — you don't read the code. You regenerate it. The contract told you what "correct" looks like. The implementation didn't meet it. Throw it away, try again.

When a Capability contract fails — say, your E2E test for "user can export invoices as CSV" starts failing — you know the problem is somewhere between "the code works" and "the user can do the thing." Maybe a UI change broke the export button. Maybe the API response shape changed. You don't grep through the codebase. You look at which Capability contract failed and let the agent fix it against that contract.

When a Value contract fails — say, average time from invoice-sent to payment-received creeps above 14 days — the code is fine. The tests pass. The user can do the thing. But the business outcome is degrading. Maybe the invoice email copy isn't compelling enough. Maybe the payment link is buried. This is the kind of failure traditional testing never catches. In CDD, it's a contract violation like any other.

In each case, you debug at the contract level, not the code level. Which contract is missing? Which contract is wrong? Which layer lied?

The implementation is disposable. Generate it. Throw it away. Generate again.

The contracts are the product.


The Bet

CDD doesn't guarantee correctness. Contracts can be poorly specified. Checks can have bugs. Judges can be wrong.

What CDD guarantees is explicitness. When something fails, you know exactly where. Which layer. Which contract. Which field.

So we bet that:

  1. Outcomes can be contracted.
  2. Contracts can be validated.
  3. Validation can be trusted.
  4. Implementation can be ignored.

If true, we've just made 90% of software development—the implementation part—a commodity.

The remaining 10%? Defining what "value" means. Writing good contracts. That's the hard part. That's the human part.

Let's make it concrete.


The Bridge

Everything above is philosophy. Useful philosophy — but if you closed this tab right now, you'd have a framework and zero idea how to apply it.

The gap between "contracts define reality" and "ok but what do I actually check" is where most manifestos die. This one doesn't get to.

So here's the question: when a piece of code ships — written by a human, an agent, or some hybrid that doesn't have a name yet — what must be true about it?

Not "what's nice to have." What must be true?


Seven Starting Points

If you asked a great code reviewer what they're actually checking, most of it boils down to something like this:

  1. It works correctly, including edge cases.
  2. It's readable by someone who didn't write it.
  3. It's secure.
  4. It's simple enough to change later.
  5. It fits the existing codebase patterns.
  6. It has no obvious performance or cost problems.
  7. The critical paths are tested.

Your list might be different. Maybe you'd add one, drop one, split one into two. That's fine. The point isn't that these seven are sacred — it's that the checklist used to be implicit, living in the head of a senior engineer who's been around long enough to know what "good" looks like.

Now it has to be explicit. Because the reviewer is a machine.


Old vs New

Every contract used to be verified by a human reading a diff. That doesn't scale to a million lines a day. Here's how each one shifts:

ContractHuman Review (old)Automated Verification (new)
CorrectnessRead the code
Think through logic
Test suites
Property-based tests

Fuzzing
ReadabilityRead the diff
Squint at naming
On-demand LLM explanation
Generated docs
SecurityReviewer knowledge
Gut check
SAST/DAST
Adversarial LLM scanning
Simplicity"This feels over-engineered"Complexity metrics
Coupling analysis
ConsistencyPattern recognition from experienceAST pattern matching
Convention-as-code
Performance/CostReviewer spots N+1 queriesBenchmarks
Load tests

Cost regression detection
Testing"Did you write tests?"Coverage reports
Mutation testing

The contracts don't change. How you verify them does.


Who Watches the Watchmen

Here's the obvious problem: if an LLM writes the code and the tests, what stops it from cheating?

This isn't hypothetical. When LLMs encounter friction getting a test to pass, they will absolutely hollow out the assertion. The sneaky version looks like this: call the function (so coverage stays high), then assert something meaningless about the result. Test passes. Coverage looks great. The assertion proves nothing.

You need verification layers that are hard to game. Three that actually work:

Coverage + mutation testing.

Coverage tells you "this code ran during a test." Mutation testing tells you "this code is actually being checked."

It works by deliberately injecting small bugs — flipping a > to >=, swapping true to false, deleting a line — then running your tests against each mutant. If your tests catch the bug, the mutant is "killed." If all tests still pass, the mutant survived — and your tests are lying to you.

assert(true).equals(true) kills zero mutants. It gets flagged instantly.

Mutation testing is purely mechanical — no LLM needed. Tools like Stryker (JS/TS) and mutmut (Python) have been around for years. They just weren't worth the compute cost when humans were reviewing code anyway. In a world where humans can't review code, the math changes.

Holdout scenarios.

StrongDM's most elegant insight: store your validation scenarios outside the codebase, where the coding agent can't see them. Like a holdout set in ML training. The agent can't game what it can't access. You define end-to-end user stories separately, and the system has to satisfy them blind.

Adversarial LLM review.

The LLM that writes the code never reviews it. A second model, with a different prompt and different incentives, specifically tries to break it. Its job is to find hollow assertions, tautological tests, gaming. It's not reading the code for style — it's attacking it for substance.

No single layer is trustworthy alone. You need gates that verify other gates.


The How-To

Enough theory. Here's how you actually do this.

You need Claude Code or Cursor CLI. A ralph loop. And a team.


Build the Team

Ralph Wiggum directing his agent team

Before you write a line of code — before you write a PRD — you build the team. Not a human team. An agent team.

Each agent gets a SOUL: a markdown file that defines who they are, what they're good at, what tools they can use, and what they're not allowed to do. Constraints focus agents. An agent "good at everything" is mediocre.

.ralph/
├── agents/
│   ├── fury/
│   │   └── SOUL.md       # Product researcher
│   ├── friday/
│   │   └── SOUL.md       # Developer
│   ├── shuri/
│   │   └── SOUL.md       # QA / Attacker
│   ├── loki/
│   │   └── SOUL.md       # Content writer
│   └── pepper/
│       └── SOUL.md       # Marketing / Email
├── memory/
│   ├── CODEBASE.md        # What the agents have learned about this project
│   └── CONTEXT.md         # Who you are, what you care about
├── teams/
│   ├── engineering.json   # friday, shuri
│   ├── research.json      # fury
│   └── marketing.json     # loki, pepper
└── scenarios/             # Holdout tests (agents can't see these)

Here's what a SOUL looks like:

# SOUL.md — Friday (Developer)

## Identity
Name: Friday
Role: Developer
Team: Engineering

## Personality
Code is poetry. Clean, tested, documented.
Prefers small commits. Runs tests before committing.
Asks Shuri for review on anything user-facing.

## Skills
- TypeScript, React, Next.js
- Testing (vitest, playwright)
- Database migrations
- API design

## Tools
Allowed: shell, filesystem, git
Denied: email, calendar

## Boundaries
- Won't merge without tests passing
- Escalates to human if touching auth or payments
- Asks for design review on UI changes

The SOUL persists across tasks. Friday learns your codebase. What worked, what didn't, which patterns the team prefers. That knowledge lives in memory/CODEBASE.md and compounds over time.

This is different from a prompt. A prompt is disposable. A SOUL is an identity.


Define the Contracts

This site — the one you're reading right now — is where I'm building this. So I'll use it as the example.

CONTRACTS.md — The interface between you and your agents.

# Contracts

## Code Quality
- All functions under 50 lines
- Max cyclomatic complexity: 10
- Mutation testing score > 80% (Stryker)
- No known vulnerabilities (npm audit)
- Coverage > 90%
- Descriptive naming, consistent patterns (ESLint)
- Pages load in under 2 seconds (Lighthouse)

## User Flows
- Reader can browse articles by tag
- Reader can search articles
- Reader can subscribe to newsletter
- Reader can share an article
- All pages render correctly on mobile

## Behavior
- Article MDX compiles without errors (property-based)
- OG images generate correctly for every article
- RSS feed includes all published articles
- Dark mode persists across sessions

## Brand
- Article copy sounds like me, not a corporation (LLM-as-judge)
- Landing page clearly explains what the site is about (LLM-as-judge)
- UI feels clean and personal, not a template (LLM-as-judge)

## Tooling
- Vitest + fast-check for unit and property-based tests
- Stryker for mutation testing
- Playwright for E2E user flows
- ESLint for complexity and consistency
- GitHub Actions to run everything on every PR

These contracts only grow. You add, tighten, never remove. Every feature your agents build must pass every contract that came before. That's the ratchet. That's how correctness compounds instead of decays.


The Proposal

Here's where it gets interesting. The agents don't wait for me to tell them what to build. Fury researches features on his own and proposes them.

He wakes up on a schedule. Looks at analytics, reader behavior, industry trends, what competitors are doing. Then he writes a proposal — a YAML file with everything I need to make a decision:

# .ralph/proposals/live-agent-feed.yaml
name: Live Agent Feed
status: proposed
proposed_by: fury
proposed_at: 2026-02-02

problem: |
  The CDD article makes a big claim: agent teams can maintain
  a production site autonomously. But the reader has no proof.
  They read the theory, nod, and leave. There's no artifact
  that makes them think "holy shit, this is actually running."

evidence:
  - type: analytics
    file: artifacts/analytics-article-engagement.json
    summary: |
      Average time on CDD article: 4m12s. Bounce rate: 62%.
      Readers engage with the philosophy but don't convert
      to newsletter or return visits.
  - type: competitor
    file: artifacts/competitor-transparency-analysis.md
    summary: |
      - Linear ships a public changelog with real-time updates
      - Vercel has a status page showing deployments
      - No personal site shows live agent activity
      - First-mover advantage is real here
  - type: research
    file: artifacts/transparency-trust-research.md
    summary: |
      Transparent build processes increase trust by 34% (Edelman).
      "Building in public" content gets 2.7x more engagement
      than polished announcements (SparkToro, 2025).
    sources:
      - https://www.edelman.com/trust/trust-barometer
      - https://sparktoro.com/blog/building-in-public-data

solution: |
  Add a /agents page to the site. Real-time feed showing:
  - What each agent is currently working on
  - Recent proposals (with full YAML visible)
  - Contract dashboard (pass/fail/new)
  - Ship log (what went live and when)
  - Cost transparency (tokens spent per agent per week)

  Readers of the CDD article land on /agents and see the
  proof. The article links to it. The page IS the argument.

mockups:
  - file: artifacts/agent-feed-mockup.png
    description: Desktop layout — activity stream left, contract status right
  - file: artifacts/agent-feed-mobile.png
    description: Mobile — single column, cards

marketing_angle: |
  "This site is maintained by an AI team. Watch them work."

marketing_assets:
  - type: social_post
    file: artifacts/social-agent-feed-launch.md
    summary: |
      "I added a page to my site where you can watch my AI team
      work in real time. They research features, build them, attack
      them, and ship them. I just approve or reject proposals."
  - type: og_image
    file: artifacts/og-agent-feed.png
  - type: article_update
    file: artifacts/cdd-article-addendum.md

contracts:
  - Agent activity feed updates within 60 seconds of agent action
  - Proposal YAML renders correctly with syntax highlighting
  - Contract dashboard shows real pass/fail from latest CI run
  - Page loads in under 2 seconds (Lighthouse)
  - Ship log shows last 30 days of merged PRs with descriptions
  - Cost tracker displays weekly token spend per agent
  - Mobile layout is usable without horizontal scroll
  - LLM-as-judge: page feels like a mission control dashboard, not a log dump
  - LLM-as-judge: a non-technical reader can understand what the agents are doing

effort: medium
priority: critical

follow_up: |
  Fury checks /agents page engagement 14 days after ship.
  Track: unique visitors, time on page, shares, newsletter
  signups from /agents vs from articles.
  If engagement > 2x article average, propose expanding to
  include a "request a feature" form that feeds into
  the proposal pipeline.

Every file: path points to something real in an artifacts/ folder next to the YAML. The research. The screenshots. The competitive analysis. The mockups. I'm not trusting a summary — I can click through to the raw data in one hop.

My job is one word. I open the YAML. I change status: proposed to:

  • approved — ship it
  • rejected — not interested
  • deferred — good idea, not now

That's my entire decision surface.


The Build

I changed status to approved. Went to bed.

ralph.sh friday live-agent-feed 20

Friday picks up the approved proposal. Reads the contracts. Builds the feature. Writes tests that verify each contract. Runs them. Commits when green. Moves to the next piece.

Each iteration compounds on the last. The contract suite grows monotonically. Over 20 iterations: 1.01^20 = 1.22x better. Over 100: 1.01^100 = 2.7x. The contracts are the ratchet.

Friday also writes to memory/CODEBASE.md as he works — patterns he's discovered, conventions the project uses, gotchas to remember. The next iteration benefits from what the last one learned.


The Attack

Friday wrote the code and the tests. Shuri's job is to break them.

ralph.sh shuri live-agent-feed-verify 10

Shuri's SOUL says she's skeptical. She thinks like a first-time reader. Her loop:

  1. Read the proposal's contracts
  2. For each contract, evaluate whether the tests actually verify it or just pretend to
  3. Find hollow assertions, missing edge cases, gaming
  4. Read scenarios/ — holdout tests that Friday never saw
  5. Report what's broken. Fix what she can. File tasks for what she can't.

Friday never sees Shuri's prompt. Shuri never sees Friday's instructions. Different SOULs, different tools, different incentives. Separation of concerns baked into the team structure.

If Shuri finds a hollow assertion, she files a task. Friday picks it up on his next heartbeat.


The Content

While Friday and Shuri are doing their thing, Loki and Pepper are doing theirs.

ralph.sh loki live-agent-feed-content 5

Loki reads the proposal's marketing_angle and marketing_assets. His loop:

  1. Write the social posts referenced in the proposal
  2. Write a changelog entry for the site
  3. Update the CDD article with a "see it live" callout
  4. Validate everything against the Brand contracts (LLM-as-judge)
  5. Iterate until the vibes pass
ralph.sh pepper live-agent-feed-announce 3

Pepper writes the newsletter announcement. She has access to email tools (Friday doesn't). But sending requires human approval.

The dashboard shows: "Pepper wants to send a newsletter to 1,247 subscribers. Subject: 'This site is now maintained by AI agents. Watch them work.' [Approve] [Edit] [Deny]"

I read it. I approve it. Pepper sends it.


The Ship

By morning, a PR is open. CI ran every contract. All green. It merged.

The /agents page is live. Readers of this article can click over and watch the team work. The article's thesis has a proof.


The Follow-Up

Two weeks later, Fury wakes up again. The follow_up field in the proposal told him to check engagement.

ralph.sh fury live-agent-feed-validate 3

His loop:

  1. Pull analytics for /agents — unique visitors, time on page, shares
  2. Compare to article engagement baseline
  3. If engagement > 2x, propose expanding with a "request a feature" form
  4. If engagement is flat, propose changes or mark the feature for review
  5. Update CONTRACTS.md with any new contracts based on findings

The researcher doesn't just report — he proposes new contracts. The suite only grows.


The Dashboard

Your interface isn't terminal output. It's a dashboard.

  • Proposal queue — new proposals from Fury, waiting for your one-word verdict
  • Activity feed — real-time stream of what every agent is doing
  • Contract status — which contracts pass, which fail, which are new
  • Ship log — what went live and when
  • Cost tracker — tokens per agent, per team, per day

You're not reading code. You're not reading diffs. You're watching a team work, reviewing their output, and steering with contracts.


The Whole Picture

You (human)
  ├── Approve proposals ──→ .ralph/proposals/ (one word)
  ├── Define what matters ──→ CONTRACTS.md
  ├── Approve sensitive actions ──→ Dashboard
  └── Steer based on results ──→ Update contracts
Agent Team
  ├── Fury (researcher) ──→ proposes features with evidence + artifacts
  ├── Friday (developer) ──→ builds approved features, runs contracts
  ├── Shuri (QA) ──→ attacks code, runs holdout scenarios, finds gaming
  ├── Loki (content) ──→ writes copy, validates brand contracts
  └── Pepper (marketing) ──→ newsletters, campaigns (needs approval to send)
Contracts (the interface)
  ├── Code Quality ──→ automated (Vitest, Stryker, ESLint)
  ├── User Flows ──→ automated (Playwright)
  ├── Behavior ──→ automated (fast-check, property-based tests)
  ├── Brand ──→ automated (LLM-as-judge)
  └── Growing ──→ every loop adds verification, never removes it

The proposals folder is the roadmap. The artifacts folder is the evidence. The contracts are the acceptance criteria. The human writes one word: approved.

The contracts are the ratchet.