Reverse-Engineering Spring Petclinic with AI Agents: 578 Tests, 95% API Compatibility
How we used spec-driven AI agents to reverse-engineer 8 Java microservices into Python/FastAPI — 122 tasks completed deterministically, 578 tests passing — and what this reveals about AI's real capability frontier.
There's a common skepticism about AI-assisted coding that goes something like this: "Sure, AI can write a to-do app, but can it handle real software — complex, multi-service architectures with actual business logic?"
We decided to test this directly. Spring Petclinic is the canonical Java microservices reference application — 8 interconnected services, service discovery, API gateway, config server, distributed tracing. It's not trivial. It has genuine architectural complexity and well-documented expected behavior.
The challenge: reverse-engineer all 8 microservices from Java/Spring Boot to Python/FastAPI, using AI agents as the primary builders, with 95%+ API compatibility. Not a rewrite guided by a human developer reading the source. A systematic, agent-driven translation where the agents do the implementation work and the methodology ensures correctness.
The result: 578 tests passing, 95%+ API compatibility, 122 tasks tracked and completed deterministically. Every task verified by automated tests. Zero manual coding of business logic.
Why reverse-engineering is the hard test
Forward development — building something new from a spec — is the easy case for AI agents. The agent has freedom to make design decisions, and there's no reference implementation to match exactly.
Reverse-engineering is harder because correctness is externally defined. The original Java services are the ground truth. Every API endpoint must accept the same inputs and return the same outputs. Every error case must behave the same way. Every edge case in the original's behavior is a test case for the rewrite.
This is exactly the kind of task where sloppy AI assistance falls apart. A human developer asking ChatGPT "convert this Java service to Python" will get something that looks right but silently differs in dozens of edge cases — date formatting, null handling, validation order, error response structure. These differences compound across 8 services until the system doesn't actually work.
We needed a methodology that would catch every one of these differences systematically.
The methodology: spec-driven decomposition
The process followed three phases:
Phase 1: Specification extraction
Before any Python code existed, AI agents analyzed each Java microservice and produced a structured specification:
- API surface — Every endpoint, request format, response format, status codes, and error responses
- Data model — Entity schemas, relationships, validation constraints
- Service interactions — Which services call which, message formats, failure modes
- Configuration — Environment variables, feature flags, default values
The specs weren't documentation. They were executable contracts — each API behavior became a test case.
# Example: Vet Service spec (partial)
endpoints:
GET /vets:
response_200:
body: array of Vet objects
fields: [id, firstName, lastName, specialties]
ordering: by lastName ascending
pagination: none (returns all)
test_cases:
- name: "returns all vets with specialties populated"
expected_count: 6 # matches seed data
- name: "specialty is array, not single value"
assert: response[0].specialties is Array
GET /vets/{id}:
response_200: single Vet object
response_404:
body: {"message": "Vet not found"}
status: 404
test_cases:
- name: "returns vet by id"
input: {id: 1}
assert: response.firstName == "James"
- name: "returns 404 for nonexistent id"
input: {id: 9999}
assert: status == 404
Phase 2: Task decomposition with Beads
122 tasks. Not a rough estimate — the exact number of discrete, testable units of work needed to rewrite 8 services.
We used Beads to decompose the specification into an ordered task graph. Each task had:
- A clear scope (one endpoint, one model, one integration)
- Acceptance criteria (specific tests that must pass)
- Dependencies (what must be built first)
- Verification command (
pytest tests/test_vet_service.py -k "test_get_all_vets")
bd list --status open
○ pc-001 ● P0 Vet service: data models and DB schema
○ pc-002 ● P0 Vet service: GET /vets endpoint
→ depends on: pc-001
○ pc-003 ● P0 Vet service: GET /vets/{id} endpoint
→ depends on: pc-001
○ pc-004 ● P1 Vet service: seed data matching Spring fixtures
→ depends on: pc-001
○ pc-005 ● P1 Owner service: data models and DB schema
...
○ pc-122 ● P2 API Gateway: route configuration and load balancing
→ depends on: pc-098, pc-099, pc-100
The dependency graph ensured agents couldn't build an endpoint before its data model existed, couldn't build an integration before both services were complete, couldn't build the gateway before all services were ready.
Phase 3: Agent-driven implementation
Each task followed a strict TDD cycle:
- Write the test first — derived directly from the spec
- Run the test — it must fail (red phase)
- Implement the code — AI agent writes the Python/FastAPI implementation
- Run the test — it must pass (green phase)
- Run the full suite — no regressions
- Mark task complete in Beads — move to next task
The AI agent doing the implementation had access to: the specification, the test that needs to pass, the existing codebase (previously completed tasks), and the original Java source for reference. It did not have freedom to skip tests, modify existing tests, or mark tasks done without passing verification.
// The agent loop (simplified)
while (hasRemainingTasks()) {
const task = getNextUnblockedTask();
markInProgress(task);
// TDD: Red
const testCode = agent.writeTest(task.spec, task.acceptanceCriteria);
const redResult = runTests(testCode);
assert(redResult.failing > 0, "New test must fail before implementation");
// TDD: Green
const implCode = agent.implement(task.spec, testCode, existingCode);
const greenResult = runTests(testCode);
if (greenResult.passing === greenResult.total) {
// No regressions
const fullSuiteResult = runFullSuite();
if (fullSuiteResult.regressions === 0) {
markComplete(task);
commit(`feat(${task.id}): ${task.title}`);
}
}
}
The Ralph Loop: build-evaluate as engineering discipline
The methodology powering this project is what I call the Ralph Loop — a structured build-evaluate cycle that treats AI agent output as untrusted until verified:
- Specify — Define what correct looks like, concretely
- Decompose — Break into atomic, testable tasks with dependencies
- Build — AI agent implements one task at a time
- Verify — Automated tests confirm correctness
- Evaluate — Separate evaluator agent reviews the build holistically
- Fix — Address evaluator findings, re-verify
The critical principle: the builder and the evaluator are separate agents with separate incentives. The builder optimizes for making tests pass. The evaluator optimizes for finding problems. This adversarial structure (similar to GANs) prevents the common failure mode where an agent marks its own work as complete without genuine validation.
For the Petclinic project, the evaluator caught several categories of issues that tests alone missed:
- Behavioral divergence — The Python service returned dates in ISO format while Java used a custom format. Tests checked for a date, not the exact format.
- Missing error cases — Some endpoints had error behaviors in Java that weren't in the spec (and thus had no tests). The evaluator found these by comparing response codes systematically.
- Integration ordering — Individual services worked, but the startup sequence didn't match Spring's service discovery pattern. The evaluator tested the full stack, not just individual services.
Results by the numbers
After 122 tasks completed:
| Metric | Result | |--------|--------| | Total tests | 578 | | Tests passing | 578 (100%) | | API endpoints covered | 43 | | API compatibility | 95.4% | | Services fully ported | 8/8 | | Manual code written | 0 lines of business logic | | Tasks tracked in Beads | 122 | | Evaluator rounds | 3 |
The 4.6% API incompatibility is in areas where Python/FastAPI fundamentally handles things differently than Java/Spring — response header ordering, specific exception message formatting, and XML content negotiation (we chose to support JSON only).
What this proves — and what it doesn't
What it proves: AI agents, given the right methodology, can execute large-scale, complex software engineering tasks with deterministic outcomes. The key is not the model's capability in isolation — it's the harness: spec-driven decomposition, dependency-ordered execution, automated verification at every step, and adversarial evaluation.
What it doesn't prove: That AI agents can architect these systems from scratch. The specification and task decomposition required engineering judgment about what to build, in what order, and what the acceptance criteria should be. The agents were excellent executors. The methodology was the architect.
This distinction matters. The current discourse oscillates between "AI will replace all programmers" and "AI is just autocomplete." The truth is more specific: AI agents are extraordinarily capable builders when given well-structured work and rigorous verification. The engineering skill shifts from writing code to designing the harness that ensures the code is correct.
Orchestrating 122 tasks across context resets
One practical challenge deserves mention: context limits. No AI model can hold 122 tasks, 8 service specifications, and a growing codebase in a single context window. The work necessarily spans multiple sessions.
Beads solves this by maintaining the task graph outside any agent's context. When an agent starts a new session, it reads the current state from Beads: what's done, what's in progress, what's blocked, what's next. The agent doesn't need to remember anything from previous sessions — the truth is in the task tracker.
# Agent starts a new session
bd list --status open # What needs doing?
bd show pc-047 # What are the details of the next task?
bd update pc-047 --status in_progress # Claim it
# ... do the work ...
bd close pc-047 # Done
bd list --status open # What's next?
This pattern — externalizing agent memory into a structured task graph — is what makes long-horizon agent work possible. The individual agent sessions are stateless. The task tracker carries the state. This is the same principle behind job queues in distributed systems, applied to AI agent orchestration.
The spec-driven future
The Petclinic project convinced me that spec-driven development isn't just a methodology preference — it's a prerequisite for reliable AI-assisted engineering. Without executable specifications, you're relying on the model's judgment about what "correct" means. With them, correctness is defined, measurable, and enforced.
The toolchain for this is still early. Beads handles task orchestration. The eval framework handles verification. The adversarial evaluator handles holistic quality. But each piece required custom engineering. The opportunity is in making this methodology accessible — giving every developer a harness that turns AI agents from probabilistic assistants into deterministic builders.
The 578 passing tests aren't impressive because an AI wrote them. They're impressive because every one of them means something — a specific behavior, specified in advance, verified automatically, that the system provably implements correctly. That's not demo-ware. That's engineering.