Spec-Driven AI: Building an Expense Tracker That Learns from Corrections
How spec-first development and an eval framework turned a simple Gmail parser into a self-improving expense categorizer — and why building the eval harness before the feature is the most important habit in AI engineering.
Here's a pattern I see repeatedly in AI projects: a developer builds a working prototype in a weekend, demos it to applause, then spends the next three months discovering edge cases that the prototype handles incorrectly, with no systematic way to know if fixes introduce regressions.
The expense tracker started as a personal itch. Indian banks send transaction alerts via email — "Rs 450 debited from your account at SWIGGY" — and I wanted an AI to parse these, categorize them, and maintain a running ledger. Simple enough for a weekend prototype. But I wanted to build it correctly, which meant building the eval framework first.
The spec-first methodology
Before writing a single line of parsing logic, I wrote the spec. Not a vague product doc — a structured specification with concrete examples of every input format and expected output.
Indian bank alerts are deceptively varied. The same bank sends different formats for debit cards, credit cards, UPI, and NEFT transfers. Different banks use entirely different templates. Amounts appear as "Rs 450", "INR 450.00", "Rs. 4,50,234" (Indian comma notation), or "₹450". Merchant names are truncated, abbreviated, or replaced with payment processor names.
The spec captured 47 distinct email formats across 6 banks, each with:
interface TransactionSpec {
input: string; // Raw email body
expected: {
amount: number; // Normalized to paise
merchant: string; // Cleaned merchant name
category: string; // Food, Transport, Shopping, etc.
type: "debit" | "credit";
date: Date;
account: string; // Last 4 digits
};
edge_case?: string; // Why this example matters
}
Each spec entry is a test case. The eval framework runs all 47 on every change and reports precision by field. This isn't a nice-to-have bolted on after development — the eval harness is the first thing that exists.
Why evals before features
The temptation with AI is to build the feature, see it mostly work, and call it done. The problem is "mostly work." An LLM that correctly parses 90% of transactions sounds good until you realize the 10% failure rate means corrupted financial data every few days.
Building evals first inverts the incentive structure:
Without evals: You build the parser. It seems to work. You ship it. Users report bugs. You fix them one by one, with no confidence that fixes don't break other cases. You're playing whack-a-mole.
With evals: You define what "correct" means for 47 cases. You build the parser. It passes 38/47. You know exactly which 9 cases fail and why. You fix them, and the eval suite guarantees the 38 that worked still work. Every change has a measurable impact on correctness.
This is the same discipline that makes traditional software engineering work — test-driven development. The difference with AI systems is that the tests often need to be probabilistic, and the "implementation" is a prompt rather than deterministic code.
The parsing pipeline
The expense tracker's parsing pipeline has three stages:
Stage 1: Email extraction — Gmail API fetches transaction alerts using filters. Raw email bodies are extracted, HTML stripped, and normalized to plain text. This stage is entirely deterministic — no AI involved.
Stage 2: Transaction parsing — Claude parses the normalized email into structured transaction data. The prompt includes few-shot examples covering each bank's format. This is where most complexity lives, and where the eval suite focuses.
const PARSE_PROMPT = `Extract transaction details from this bank alert email.
Return JSON with these fields:
- amount: number in paise (e.g., Rs 450 = 45000)
- merchant: cleaned merchant name (expand abbreviations)
- type: "debit" or "credit"
- date: ISO 8601 format
- account: last 4 digits of account/card
Examples:
---
Input: "Your a/c XX4521 debited by Rs 450.00 at SWIGGY on 15-03-26"
Output: {"amount": 45000, "merchant": "Swiggy", "type": "debit",
"date": "2026-03-15", "account": "4521"}
---
Input: "INR 12,500.00 credited to a/c ending 8834 - NEFT from ACME CORP"
Output: {"amount": 1250000, "merchant": "Acme Corp", "type": "credit",
"date": "...", "account": "8834"}
`;
Stage 3: Categorization — A second Claude call categorizes the transaction. This is where it gets interesting, because categorization is subjective and personal. "Zomato" is "Food" for most people, but if you're buying groceries through Zomato Market, it might be "Groceries."
Learning from corrections
The categorization stage has a feedback loop. When the user corrects a category — "No, this Zomato charge was Groceries, not Food" — the correction is stored and injected into future categorization prompts as a user-specific override:
interface CorrectionStore {
merchant: string;
originalCategory: string;
correctedCategory: string;
context?: string; // "Zomato Market orders are Groceries"
timestamp: Date;
}
function buildCategoryPrompt(
transaction: Transaction,
corrections: CorrectionStore[]
): string {
const relevantCorrections = corrections.filter(
c => c.merchant.toLowerCase() === transaction.merchant.toLowerCase()
);
let prompt = BASE_CATEGORY_PROMPT;
if (relevantCorrections.length > 0) {
prompt += `\n\nUser corrections for ${transaction.merchant}:\n`;
for (const c of relevantCorrections) {
prompt += `- "${c.merchant}" should be "${c.correctedCategory}"`;
if (c.context) prompt += ` (${c.context})`;
prompt += "\n";
}
}
return prompt;
}
This isn't fine-tuning. It's prompt-time personalization — the model's behavior adapts to the user through context, not weight updates. The eval framework tests this too: given a set of corrections, does the model correctly apply them to future transactions from the same merchant?
The eval framework in detail
The eval framework runs as a CLI command: npm run eval. It executes every spec case, compares outputs field-by-field, and produces a structured report:
Expense Tracker Eval — 2026-03-20
Parsing accuracy:
amount: 46/47 (97.9%) ← 1 failure: Indian comma notation edge case
merchant: 44/47 (93.6%) ← 3 failures: abbreviation expansion
type: 47/47 (100%)
date: 45/47 (95.7%) ← 2 failures: ambiguous DD-MM vs MM-DD
account: 47/47 (100%)
Categorization accuracy:
base: 41/47 (87.2%)
w/corrections: 46/47 (97.9%) ← corrections fix 5 edge cases
Overall: 44.3/47 average field accuracy (94.3%)
Regression from last run: none
New failures: none
Every field has its own accuracy metric. Every run is compared to the previous run. Regressions are flagged automatically. This gives you a single number — 94.3% — that you can watch move up over time, with full drill-down into what's failing and why.
What spec-driven development looks like in practice
The workflow for adding support for a new bank:
- Collect 5-10 sample emails from the bank
- Write spec entries for each email format (expected parse output)
- Run the eval — all new cases will fail (they're new)
- Update the prompt with examples for the new bank's format
- Run the eval — new cases should pass, existing cases shouldn't regress
- If regression: Fix the prompt, repeat from step 5
This is mechanical. There's no guesswork about whether the change worked. The eval tells you exactly what improved, what regressed, and what's still failing. It turns AI development from an art into an engineering discipline.
The broader pattern
The expense tracker is a small project, but it demonstrates a methodology that scales to much larger systems. The principle is simple: define "correct" before you build "working."
Every AI feature has a correctness surface — the set of inputs and expected outputs that define what the feature should do. Building the eval harness first means you're always developing against that surface, not against your intuition about whether things look right.
This is what I mean by spec-driven development. It's not about writing documents. It's about writing executable specifications that hold your AI system accountable to concrete, measurable standards of correctness. The spec is the test suite. The test suite is the spec. Everything else is implementation detail.