Building AI Agents with Claude: Lessons from Production
Practical insights from building AI operations agents at scale — from system design to production reliability.
After running AI agents in production at Hakimo AI, I've collected hard-won lessons about what makes them reliable, maintainable, and genuinely useful. Here's what I wish I'd known at the start.
The Architecture That Worked
Our AI Operator handles security monitoring across 2000+ cameras. The core insight was treating the agent as an orchestrator, not a monolith.
interface AgentConfig {
model: "claude-opus-4-6" | "claude-sonnet-4-6";
tools: Tool[];
maxIterations: number;
systemPrompt: string;
}
async function runAgent(
config: AgentConfig,
userMessage: string
): Promise<AgentResult> {
const messages: Message[] = [{ role: "user", content: userMessage }];
for (let i = 0; i < config.maxIterations; i++) {
const response = await anthropic.messages.create({
model: config.model,
system: config.systemPrompt,
tools: config.tools,
messages,
});
if (response.stop_reason === "end_turn") {
return { success: true, output: extractText(response) };
}
const toolResults = await executeTools(response.content);
messages.push(
{ role: "assistant", content: response.content },
{ role: "user", content: toolResults }
);
}
return { success: false, error: "Max iterations reached" };
}
The Plan-Execute Pattern
For complex investigations, a two-phase approach dramatically improved reliability:
- Planning phase — Claude generates a structured investigation plan
- Execution phase — Claude executes each step, collecting evidence
async function investigateIncident(incidentId: string): Promise<Report> {
// Phase 1: Plan
const plan = await generatePlan(incidentId);
// Phase 2: Execute each step
const evidence: Evidence[] = [];
for (const step of plan.steps) {
const result = await executeStep(step);
evidence.push(result);
}
// Phase 3: Synthesize
return synthesizeReport(plan, evidence);
}
This reduced investigation time by 83% compared to our previous manual process.
The 20+ Tool Problem
Our agent has over 20 diagnostic tools. Early on, we made every tool available in every request. This caused two problems:
- Claude would sometimes choose a slow tool when a fast one would do
- Large tool schemas inflated context windows unnecessarily
The fix was dynamic tool selection:
function selectRelevantTools(context: IncidentContext): Tool[] {
const always = [queryLogs, getAlertHistory];
const conditional: Tool[] = [];
if (context.hasCameraAlerts) conditional.push(getCameraFeed, checkPTZ);
if (context.hasAccessEvents) conditional.push(queryAccessLogs);
if (context.isHighSeverity) conditional.push(escalateToHuman);
return [...always, ...conditional];
}
Key insight: The model doesn't need to see every tool. Give it the right tools for the context, not all tools always.
Observability Is Non-Negotiable
Running agents without observability is flying blind. We use Opik for LLM tracing. Three metrics that matter most:
- Tool call success rate — Are tools returning useful results?
- Iteration count distribution — Is the agent getting stuck in loops?
- Token usage per investigation — Are costs predictable?
import { track } from "opik";
const trackedAgent = track(runAgent, {
name: "security-investigation",
tags: ["production", "v2"],
capture: ["input", "output", "tool_calls"],
});
Auto-Scaling with KEDA
At 600 RPM peak, a single agent instance wasn't enough. We use KEDA to scale based on queue depth:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ai-operator
spec:
scaleTargetRef:
name: ai-operator-deployment
triggers:
- type: redis
metadata:
listName: incident-queue
listLength: "5" # Scale up when >5 incidents queued
minReplicaCount: 2
maxReplicaCount: 20
What I'd Do Differently
- Start with evals early. We added LLM evals six months in. The earlier you have them, the safer you can move fast.
- Version your prompts like code. System prompt changes can have surprising effects. Treat them as deploys.
- Budget tokens explicitly. Set hard limits on context window usage per operation. Runaway contexts kill costs.
Building AI agents for production is still more art than science. But the patterns above have held up across thousands of real incidents.