Formally Verifying the Easy Part

A field report on formal verification, AI-generated code, and where the real bugs live.

Mar 16, 2026

Four days ago, Axiom announced a $200M Series A at a $1.6B valuation to build AI that generates formally verified code. They join Harmonic ($295M raised, $1.45B valuation) and Logical Intelligence in a funding wave totalling over half a billion dollars, all converging on the same thesis: AI will write the code, and mathematical proofs will guarantee it works.

Martin Kleppmann captured the optimism well in his December 2025 prediction that AI will make formal verification go mainstream. The logic is appealing: proof checkers are incorruptible, LLMs are getting good at generating proofs, and once the cost of verification drops far enough, we’ll use it everywhere. Menlo Ventures, leading Axiom’s round, put it bluntly: “AI will write all the code. Mathematics will prove it works.”

I’ve spent the last few months exploring how to make this work: surveying verification languages, experimenting with Dafny, and iterating on different approaches to integrating formal proofs into AI coding workflows. What I found was less clean than the thesis suggests. The proofs worked. The math was fine. And every real bug I hit was in a place no prover could reach.

The problem I was trying to solve

Anyone who’s used AI coding agents seriously has hit the same wall. The agent writes code. It writes tests. The tests pass. And the code is still wrong.

It’s “test theatre”: the agent optimises for a green checkmark rather than for correctness. It writes tests that assert the code does what it does, not what it should do. The tests are tautological; they verify the implementation against itself.

def test_split_energy():
    result = split_energy(session)
    assert result is not None  # passes whether the function works or not

You end up reviewing both the code and the tests with equal suspicion, which means the tests haven’t saved you any work.

This is the real cost of vibe coding. Not that the code is bad (it’s often decent). The cost is that you can’t tell when it’s bad without reading every line yourself, which means the speed gain from AI-assisted development disappears entirely.

Formal verification should fix this. A mathematical proof that a function satisfies its specification cannot be faked. The proof checker, whether Lean’s small trusted kernel or Dafny’s Z3-backed solver, is a binary gate. It either accepts the proof or it doesn’t. There’s no way to write a tautological proof the way you can write a tautological test. If the specification says “these two periods must sum to the total,” and the proof goes through, then they sum to the total. Period.

What I built

The result is a plugin for Claude Code called Crosscheck. It uses Dafny, a verification language backed by the Z3 theorem prover, as the verification backend. The workflow is straightforward:

You describe what you want to verify in natural language
The system generates a Dafny specification: preconditions, postconditions, loop invariants
You review the spec (this is the human checkpoint: the spec is small, readable, close to your intent)
The system iterates on the Dafny code until the prover accepts it
Once verified, it compiles the Dafny to Python, Go, JavaScript, Java, or C#. No other verification language has that breadth of native compilation targets

A typical spec is a handful of requires and ensures clauses (preconditions and postconditions) that fit on a screen. That’s the entire human review surface. If the postconditions match your intent, everything downstream is mechanically guaranteed: the implementation, the compilation to Python, all of it.

I chose Dafny over Lean (the language the half-billion-dollar companies use) deliberately. Both are programming languages and proof assistants, but they optimise for different workflows. Lean’s only production compilation target is C; it’s increasingly used to write verified software directly (more on this below). Dafny compiles verified code to Python, Go, JavaScript, Java, and C#. For “write new verified logic and ship it to a Django or Express codebase,” Dafny is the more practical choice today. The funded companies are betting that Lean’s ecosystem will eventually cover the full stack; I needed something that could slot into existing production code now.

The field test

I used Crosscheck on a real problem at work: energy-splitting logic for a Django rewards system. When a charging session spans a month boundary, its energy needs to be split across the two months so each period gets correct attribution of energy which impacts how we calculate user rewards downstream. Getting the arithmetic wrong means misattributing reward payments and potentially impacts revenue, which makes this serious enough to warrant formal verification but pedestrian enough to be representative of real production code. I picked it because, well, it was assigned to me in Jira, but pure arithmetic with clear postconditions is the best possible case for formal verification.

If it couldn’t prove its value here, it couldn’t prove it anywhere.

The spec looked like this:

method SplitEnergy(total: int, offPeak: int)
    returns (period1: int, period2: int, period1OffPeak: int, period2OffPeak: int)
    requires total >= 0
    requires 0 <= offPeak <= total
    ensures period1 + period2 == total
    ensures period1 >= 0 && period2 >= 0
    ensures period1OffPeak + period2OffPeak == offPeak
    ensures period1OffPeak >= 0 && period2OffPeak >= 0

Eight lines. The requires clauses are preconditions (non-negative total, non-negative off-peak can’t exceed total). The ensures clauses are postconditions (both the total energy and the off-peak component are conserved across the split, and nothing goes negative). If you agree those postconditions capture your intent, the prover handles the rest.

The Dafny verification worked beautifully. Five proof obligations, all discharged automatically on the first attempt. Zero iterations needed.

The math was trivially correct.

And then I tried to integrate the verified code into the Django application, and everything fell apart.

The twist

Four bugs surfaced during integration. None of them were expressible in Dafny. None of them had anything to do with the arithmetic being wrong. They fell into two categories.

Two were contract mismatches between production components:

Bug 1: Decimal precision mismatch. The verified function computed energy values to 6 decimal places. The Django model’s DecimalField only allowed 3. The constraint lived in a different file, enforced at ORM save time. The proof guaranteed correctness to six decimal places. The database would reject anything past three.

Bug 2: Enum coercion. session.type.value returned the integer 0 instead of the string "SMART". A Python enum’s .value behaved differently than expected, so the type-checking branch never triggered.

Two were test theatre, the very failure mode I’d built formal verification to escape:

Bug 3: Silent skip on missing data. Test factories didn’t set transaction_period on session objects. The split logic checked for this field and silently returned early when it was missing. Tests passed because the function “worked”; it just didn’t do anything. This is test theatre in its purest form: a tautological test validating a no-op.

Bug 4: Test framework type opinions. A custom TestCase base class overrode assertEqual in a way that blocked comparison of Decimal objects. The verified logic was correct; the test infrastructure had opinions about types that prevented validating it.

The production bugs were contract-consistency failures. Not errors in any component’s logic: the function was correct, the model was correct, the enum definition was correct. They just disagreed about what they were promising each other. And formal verification had nothing to say about the disagreement, because it verified each component in isolation.

The test bugs were ironic. I’d built a formal verification tool specifically to escape test theatre, and the verified code passed its tests for the wrong reasons anyway. The proofs guaranteed the arithmetic was correct. The tests didn’t verify the arithmetic at all — one tested a no-op, the other couldn’t compare the output types. Formal verification solved the problem it was aimed at. Test theatre showed up in the infrastructure around it.

The fix: specs as test oracles

The fix turned out to be obvious in retrospect: use the spec itself as a test oracle. The Dafny postconditions (period1 + period2 == total, period1OffPeak + period2OffPeak == offPeak) are exactly the properties that the integration tests should assert on the Python output. A Hypothesis property test generated from those postconditions would have caught the silent skip immediately (a no-op doesn’t satisfy period1 + period2 == total) and bypassed the broken assertEqual entirely (it uses direct arithmetic comparison). I’ve since updated Crosscheck to recommend spec-derived property tests at the integration boundary, using the verified postconditions as the test oracle. The spec bridges the gap: proven correct in Dafny, enforced in Python.

You might be thinking: would a different verifier have caught the integration bugs? Lean has dependent types, which can encode constraints like decimal precision directly in the type system. It has inductive types for enums, where you pattern-match on constructors rather than calling .value and getting a dynamically-typed result back. If the entire system were written in Lean, the enum coercion bug would be structurally impossible, and the precision mismatch would be catchable if precision were encoded in the type. But I wasn’t writing the entire system in Lean. I was verifying a single function and shipping it back to a Django codebase. Any verifier used in this workflow (Dafny, Lean, Coq, anything) faces the same integration boundary. The verified function compiles to Python, where it meets Python objects, Python enums, and Python test factories.

The proofs stop at the language boundary. The bugs don’t.

Where production logic actually lives

I built static analyzers for Python and TypeScript that classify every function in a codebase by what it does: ORM queries, framework/view logic, external IO, or pure computation with no side effects. I ran them against 14 open-source codebases (Saleor, Zulip, NetBox, Cal.com, Strapi, VS Code, Home Assistant, among others) covering roughly 2.5 million lines of production code.

The pure function slice, the code that formal verification tools like Dafny and Lean can actually reason about, accounts for about 22-27% of lines of code across full-stack web applications. The number is remarkably stable regardless of language or framework. When I ran the same analyzer against the ev.energy codebase, it came back at 28%. The verified arithmetic lived in the quarter that’s reachable. The production contract bugs lived in the three-quarters that isn’t. The test theatre bugs lived outside the codebase entirely, in the test infrastructure.

The analyzers classify by absence of detected side effects, not proven purity, so the true formally verifiable percentage is likely 15-25%. And much of that quarter is trivial formatting and string manipulation where formal verification would be overkill. The slice that’s both verifiable and worth verifying is smaller still.

There’s a structural reason this gap won’t close with better provers. The most efficient places to put computation (SQL queries, database internals, ORM-level aggregations) are the least amenable to formal verification, despite active research on SQL formalisation. Replacing a single GROUP BY with “fetch all rows, then run a verified aggregation in Python” is a performance regression your users will feel and your infrastructure team will reject. In a typical web application, the places where we can verify tend to be where bugs are cheapest: an off-by-one in a utility function. The places where bugs are most expensive, like a silent data corruption from a mismatched ORM constraint, are the places we can’t verify. But the funded companies know this, and their answer is more radical than I initially assumed.

What half a billion buys

The ambition goes further than better provers. Leonardo de Moura, the creator of Lean, published an essay in February that directly addresses the integration problem. His diagnosis is identical to mine:

Today, integration is where most bugs live: the boundaries between components, the assumptions that do not quite match.

His answer is different. Rather than verify the existing messy stack, rebuild it. Layer by layer, reconstruct the critical software stack with proofs built in: cryptography first, then core libraries, storage engines like SQLite, parsers, and compilers. When every component carries a proof against a shared specification, he argues, composition is guaranteed correct by construction.

This isn’t just theory. The Lean FRO already has an AI agent that converted zlib (the compression library embedded in virtually everything) to Lean, with a machine-checked proof that decompression always returns the original data. Veil, a distributed protocol verifier built on Lean, combines model checking with full formal proof and has verified consensus protocols. AWS verified its Cedar authorization engine in Lean. These are real systems, not slides.

This is where my own argument started to make me uncomfortable. The enum coercion would be structurally impossible in Lean’s type system; there’s no .value that silently returns a different type. The precision mismatch would be catchable with dependent types. Even the test theatre bugs would evaporate: you can’t construct an object with a missing required field when the type system demands it. These bugs exist because the system is built from components in different languages with implicit contracts between them. In a fully verified stack, those contracts would be explicit and checked.

The question is whether that reconstruction reaches the application layer (Django, Rails, Express) in any meaningful timeframe. De Moura’s targets are infrastructure: crypto, compression, parsers, storage engines. There’s an enormous gap between “verified SQLite” and “verified Django ORM on top of verified SQLite on top of verified Python runtime.” For scale: Benzaken and Contejean’s SQLCoq project has spent over a decade formalising SQL semantics in Coq, covering SELECT, GROUP BY, aggregates, and nested subqueries, and still describes ORDER BY as future work. Window functions aren’t on the roadmap. A Django view stacks four or five such incompletely-formalised systems on top of each other. Even with AI dramatically accelerating proof generation, the gap between “verified zlib” and “verified Django” is measured in layers of unformalised semantics, not just engineering effort.

What would actually help

After hitting two production contract bugs and two test theatre bugs that no prover could catch, I started thinking about what would have caught them. The production bugs have a structurally simple answer: check that the contracts between components are consistent.

Does the function’s output precision satisfy the model’s field constraints? Does the enum’s runtime value type match what the branch condition expects?

These are interface questions, not implementation questions. And interface consistency is exactly the kind of pure logic an SMT solver handles well.

A contract graph verifier (a system that discovers, expresses, and checks the consistency of contracts between components) would have caught both production bugs. Not by proving any component correct, but by proving that what one component promises is what the next component expects. For the energy split, the check would look something like this:

# Contract edge: split_energy() → RewardRecord model
split_energy.guarantees:     precision ≤ 6
RewardRecord.requires:    precision ≤ 3

# Does guarantee ⟹ assumption?
# (precision ≤ 6) ⟹ (precision ≤ 3)?  NO → flag mismatch

The guarantee doesn’t satisfy the assumption. That’s a flaggable inconsistency, no proof of the function’s internals needed. A contract graph verifier is the sort of tool you invent when therapy is too expensive.

This idea exists in embedded systems and SysML tooling, where contract-based verification of composed components is well established. But no production tool does formal contract graph verification for the stack most of us actually work in: Django, Rails, Spring, Express. Production tools like Pact verify contracts pairwise (do this consumer’s expectations match this provider’s responses?) but they don’t verify the composition: are all the contracts in the system mutually consistent?

Pact checks edges. No production tool checks the graph.

The reconstruction vision, rebuilding the stack in Lean with verified interfaces, would solve this problem by construction. If every component carries a proof against a shared specification, composition is guaranteed. But that’s a multi-year, possibly multi-decade project for the application layer. Contract graph verification is the pragmatic answer for the existing stack: extract the implicit contracts from the code as it exists today, express them formally, and check consistency. It doesn’t require rewriting anything. It just makes the implicit assumptions explicit and verifiable.

The obvious objection is “just use a stronger type system.” And it’s partly right; a stricter typed language would catch some of these bugs. But types don’t cross the ORM boundary. You can’t express max_digits=3 in a type signature. You can’t type-check that an enum’s .value returns a string rather than an integer. Design-by-contract languages like Eiffel check contracts within a language boundary; the contracts that break in production span across system boundaries (function to ORM to database schema) and are implicit in framework behaviour, database schemas, and runtime configuration. Even Rust and Haskell codebases have integration boundaries with databases, APIs, and configuration that their type systems can’t reach. Another objection: “just write better integration tests.” But integration tests check the cases you thought of. Contract consistency checking verifies that interfaces are compatible for all inputs — including the edge cases your tests didn’t cover.

This is one case study, not a survey. But the bug categories (contract mismatches between components, test theatre in the infrastructure around verified code) will be familiar to anyone who’s worked in a full-stack codebase. The tool I want is not a better prover for the quarter of code that’s formally verifiable, but a system that checks whether the contracts across the other three-quarters are at least consistent with each other. Formally proven where you can, property-tested where you can’t, and for the rest, at least documented so you know where the assumptions are. A trust chain for code correctness, analogous to TLS certificate chains, where you verify the chain of guarantees rather than every component individually.

The provers are ready. The question is whether we’ll structure our code so they can reach the parts that matter, or just keep verifying the easy parts and calling it done.

I’m a Senior Full Stack Engineer at ev.energy, a smart EV charging platform. The Django rewards code in this post is real work. Crosscheck is my attempt to convince teammates to trust AI-generated code. The bugs and opinions are my own.

Brainflow

Discussion about this post

Ready for more?