Building AI Agents with Claude: Lessons from Production

After running AI agents in production at Hakimo AI, I've collected hard-won lessons about what makes them reliable, maintainable, and genuinely useful. Here's what I wish I'd known at the start.

The Architecture That Worked

AI Agent Architecture — Plan-Execute pattern with tool orchestration

Our AI Operator handles security monitoring across 2000+ cameras. The core insight was treating the agent as an orchestrator, not a monolith.

interface AgentConfig {
  model: "claude-opus-4-6" | "claude-sonnet-4-6";
  tools: Tool[];
  maxIterations: number;
  systemPrompt: string;
}

async function runAgent(
  config: AgentConfig,
  userMessage: string
): Promise<AgentResult> {
  const messages: Message[] = [{ role: "user", content: userMessage }];

  for (let i = 0; i < config.maxIterations; i++) {
    const response = await anthropic.messages.create({
      model: config.model,
      system: config.systemPrompt,
      tools: config.tools,
      messages,
    });

    if (response.stop_reason === "end_turn") {
      return { success: true, output: extractText(response) };
    }

    const toolResults = await executeTools(response.content);
    messages.push(
      { role: "assistant", content: response.content },
      { role: "user", content: toolResults }
    );
  }

  return { success: false, error: "Max iterations reached" };
}

The Plan-Execute Pattern

For complex investigations, a two-phase approach dramatically improved reliability:

Planning phase — Claude generates a structured investigation plan
Execution phase — Claude executes each step, collecting evidence

async function investigateIncident(incidentId: string): Promise<Report> {
  // Phase 1: Plan
  const plan = await generatePlan(incidentId);

  // Phase 2: Execute each step
  const evidence: Evidence[] = [];
  for (const step of plan.steps) {
    const result = await executeStep(step);
    evidence.push(result);
  }

  // Phase 3: Synthesize
  return synthesizeReport(plan, evidence);
}

This reduced investigation time by 83% compared to our previous manual process.

The 20+ Tool Problem

Our agent has over 20 diagnostic tools. Early on, we made every tool available in every request. This caused two problems:

Claude would sometimes choose a slow tool when a fast one would do
Large tool schemas inflated context windows unnecessarily

The fix was dynamic tool selection:

function selectRelevantTools(context: IncidentContext): Tool[] {
  const always = [queryLogs, getAlertHistory];
  const conditional: Tool[] = [];

  if (context.hasCameraAlerts) conditional.push(getCameraFeed, checkPTZ);
  if (context.hasAccessEvents) conditional.push(queryAccessLogs);
  if (context.isHighSeverity) conditional.push(escalateToHuman);

  return [...always, ...conditional];
}

Key insight: The model doesn't need to see every tool. Give it the right tools for the context, not all tools always.

Observability Is Non-Negotiable

Running agents without observability is flying blind. We use Opik for LLM tracing. Three metrics that matter most:

Tool call success rate — Are tools returning useful results?
Iteration count distribution — Is the agent getting stuck in loops?
Token usage per investigation — Are costs predictable?

import { track } from "opik";

const trackedAgent = track(runAgent, {
  name: "security-investigation",
  tags: ["production", "v2"],
  capture: ["input", "output", "tool_calls"],
});

Auto-Scaling with KEDA

At 600 RPM peak, a single agent instance wasn't enough. We use KEDA to scale based on queue depth:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-operator
spec:
  scaleTargetRef:
    name: ai-operator-deployment
  triggers:
    - type: redis
      metadata:
        listName: incident-queue
        listLength: "5" # Scale up when >5 incidents queued
  minReplicaCount: 2
  maxReplicaCount: 20

What I'd Do Differently

Start with evals early. We added LLM evals six months in. The earlier you have them, the safer you can move fast.
Version your prompts like code. System prompt changes can have surprising effects. Treat them as deploys.
Budget tokens explicitly. Set hard limits on context window usage per operation. Runaway contexts kill costs.

Building AI agents for production is still more art than science. But the patterns above have held up across thousands of real incidents.