Reverse-Engineering Spring Petclinic with AI Agents: 578 Tests, 95% API Compatibility

There's a common skepticism about AI-assisted coding that goes something like this: "Sure, AI can write a to-do app, but can it handle real software — complex, multi-service architectures with actual business logic?"

We decided to test this directly. Spring Petclinic is the canonical Java microservices reference application — 8 interconnected services, service discovery, API gateway, config server, distributed tracing. It's not trivial. It has genuine architectural complexity and well-documented expected behavior.

The challenge: reverse-engineer all 8 microservices from Java/Spring Boot to Python/FastAPI, using AI agents as the primary builders, with 95%+ API compatibility. Not a rewrite guided by a human developer reading the source. A systematic, agent-driven translation where the agents do the implementation work and the methodology ensures correctness.

The result: 578 tests passing, 95%+ API compatibility, 122 tasks tracked and completed deterministically. Every task verified by automated tests. Zero manual coding of business logic.

Why reverse-engineering is the hard test

Forward development — building something new from a spec — is the easy case for AI agents. The agent has freedom to make design decisions, and there's no reference implementation to match exactly.

Reverse-engineering is harder because correctness is externally defined. The original Java services are the ground truth. Every API endpoint must accept the same inputs and return the same outputs. Every error case must behave the same way. Every edge case in the original's behavior is a test case for the rewrite.

This is exactly the kind of task where sloppy AI assistance falls apart. A human developer asking ChatGPT "convert this Java service to Python" will get something that looks right but silently differs in dozens of edge cases — date formatting, null handling, validation order, error response structure. These differences compound across 8 services until the system doesn't actually work.

We needed a methodology that would catch every one of these differences systematically.

The methodology: spec-driven decomposition

The process followed three phases:

Phase 1: Specification extraction

Before any Python code existed, AI agents analyzed each Java microservice and produced a structured specification:

API surface — Every endpoint, request format, response format, status codes, and error responses
Data model — Entity schemas, relationships, validation constraints
Service interactions — Which services call which, message formats, failure modes
Configuration — Environment variables, feature flags, default values

The specs weren't documentation. They were executable contracts — each API behavior became a test case.

# Example: Vet Service spec (partial)
endpoints:
  GET /vets:
    response_200:
      body: array of Vet objects
      fields: [id, firstName, lastName, specialties]
      ordering: by lastName ascending
      pagination: none (returns all)
    test_cases:
      - name: "returns all vets with specialties populated"
        expected_count: 6  # matches seed data
      - name: "specialty is array, not single value"
        assert: response[0].specialties is Array

  GET /vets/{id}:
    response_200: single Vet object
    response_404:
      body: {"message": "Vet not found"}
      status: 404
    test_cases:
      - name: "returns vet by id"
        input: {id: 1}
        assert: response.firstName == "James"
      - name: "returns 404 for nonexistent id"
        input: {id: 9999}
        assert: status == 404

Phase 2: Task decomposition with Beads

122 tasks. Not a rough estimate — the exact number of discrete, testable units of work needed to rewrite 8 services.

We used Beads to decompose the specification into an ordered task graph. Each task had:

A clear scope (one endpoint, one model, one integration)
Acceptance criteria (specific tests that must pass)
Dependencies (what must be built first)
Verification command (pytest tests/test_vet_service.py -k "test_get_all_vets")

bd list --status open

○ pc-001 ● P0  Vet service: data models and DB schema
○ pc-002 ● P0  Vet service: GET /vets endpoint
  → depends on: pc-001
○ pc-003 ● P0  Vet service: GET /vets/{id} endpoint
  → depends on: pc-001
○ pc-004 ● P1  Vet service: seed data matching Spring fixtures
  → depends on: pc-001
○ pc-005 ● P1  Owner service: data models and DB schema
...
○ pc-122 ● P2  API Gateway: route configuration and load balancing
  → depends on: pc-098, pc-099, pc-100

The dependency graph ensured agents couldn't build an endpoint before its data model existed, couldn't build an integration before both services were complete, couldn't build the gateway before all services were ready.

Phase 3: Agent-driven implementation

Each task followed a strict TDD cycle:

Write the test first — derived directly from the spec
Run the test — it must fail (red phase)
Implement the code — AI agent writes the Python/FastAPI implementation
Run the test — it must pass (green phase)
Run the full suite — no regressions
Mark task complete in Beads — move to next task

The AI agent doing the implementation had access to: the specification, the test that needs to pass, the existing codebase (previously completed tasks), and the original Java source for reference. It did not have freedom to skip tests, modify existing tests, or mark tasks done without passing verification.

// The agent loop (simplified)
while (hasRemainingTasks()) {
  const task = getNextUnblockedTask();
  markInProgress(task);

  // TDD: Red
  const testCode = agent.writeTest(task.spec, task.acceptanceCriteria);
  const redResult = runTests(testCode);
  assert(redResult.failing > 0, "New test must fail before implementation");

  // TDD: Green
  const implCode = agent.implement(task.spec, testCode, existingCode);
  const greenResult = runTests(testCode);

  if (greenResult.passing === greenResult.total) {
    // No regressions
    const fullSuiteResult = runFullSuite();
    if (fullSuiteResult.regressions === 0) {
      markComplete(task);
      commit(`feat(${task.id}): ${task.title}`);
    }
  }
}

The Ralph Loop: build-evaluate as engineering discipline

The methodology powering this project is what I call the Ralph Loop — a structured build-evaluate cycle that treats AI agent output as untrusted until verified:

Specify — Define what correct looks like, concretely
Decompose — Break into atomic, testable tasks with dependencies
Build — AI agent implements one task at a time
Verify — Automated tests confirm correctness
Evaluate — Separate evaluator agent reviews the build holistically
Fix — Address evaluator findings, re-verify

The critical principle: the builder and the evaluator are separate agents with separate incentives. The builder optimizes for making tests pass. The evaluator optimizes for finding problems. This adversarial structure (similar to GANs) prevents the common failure mode where an agent marks its own work as complete without genuine validation.

For the Petclinic project, the evaluator caught several categories of issues that tests alone missed:

Behavioral divergence — The Python service returned dates in ISO format while Java used a custom format. Tests checked for a date, not the exact format.
Missing error cases — Some endpoints had error behaviors in Java that weren't in the spec (and thus had no tests). The evaluator found these by comparing response codes systematically.
Integration ordering — Individual services worked, but the startup sequence didn't match Spring's service discovery pattern. The evaluator tested the full stack, not just individual services.

Results by the numbers

After 122 tasks completed:

| Metric | Result | |--------|--------| | Total tests | 578 | | Tests passing | 578 (100%) | | API endpoints covered | 43 | | API compatibility | 95.4% | | Services fully ported | 8/8 | | Manual code written | 0 lines of business logic | | Tasks tracked in Beads | 122 | | Evaluator rounds | 3 |

The 4.6% API incompatibility is in areas where Python/FastAPI fundamentally handles things differently than Java/Spring — response header ordering, specific exception message formatting, and XML content negotiation (we chose to support JSON only).

What this proves — and what it doesn't

What it proves: AI agents, given the right methodology, can execute large-scale, complex software engineering tasks with deterministic outcomes. The key is not the model's capability in isolation — it's the harness: spec-driven decomposition, dependency-ordered execution, automated verification at every step, and adversarial evaluation.

What it doesn't prove: That AI agents can architect these systems from scratch. The specification and task decomposition required engineering judgment about what to build, in what order, and what the acceptance criteria should be. The agents were excellent executors. The methodology was the architect.

This distinction matters. The current discourse oscillates between "AI will replace all programmers" and "AI is just autocomplete." The truth is more specific: AI agents are extraordinarily capable builders when given well-structured work and rigorous verification. The engineering skill shifts from writing code to designing the harness that ensures the code is correct.

Orchestrating 122 tasks across context resets

One practical challenge deserves mention: context limits. No AI model can hold 122 tasks, 8 service specifications, and a growing codebase in a single context window. The work necessarily spans multiple sessions.

Beads solves this by maintaining the task graph outside any agent's context. When an agent starts a new session, it reads the current state from Beads: what's done, what's in progress, what's blocked, what's next. The agent doesn't need to remember anything from previous sessions — the truth is in the task tracker.

# Agent starts a new session
bd list --status open    # What needs doing?
bd show pc-047           # What are the details of the next task?
bd update pc-047 --status in_progress  # Claim it
# ... do the work ...
bd close pc-047          # Done
bd list --status open    # What's next?

This pattern — externalizing agent memory into a structured task graph — is what makes long-horizon agent work possible. The individual agent sessions are stateless. The task tracker carries the state. This is the same principle behind job queues in distributed systems, applied to AI agent orchestration.

The spec-driven future

The Petclinic project convinced me that spec-driven development isn't just a methodology preference — it's a prerequisite for reliable AI-assisted engineering. Without executable specifications, you're relying on the model's judgment about what "correct" means. With them, correctness is defined, measurable, and enforced.

The toolchain for this is still early. Beads handles task orchestration. The eval framework handles verification. The adversarial evaluator handles holistic quality. But each piece required custom engineering. The opportunity is in making this methodology accessible — giving every developer a harness that turns AI agents from probabilistic assistants into deterministic builders.

The 578 passing tests aren't impressive because an AI wrote them. They're impressive because every one of them means something — a specific behavior, specified in advance, verified automatically, that the system provably implements correctly. That's not demo-ware. That's engineering.