Agent demos are everywhere, but production systems need reliability, cost control, and audit trails. Here’s a blueprint we use to ship multi‑step, tool‑using agents that teams can trust.
Architecture at a glance
An agent system has four layers: the planner, tools, memory, and policies. The planner decides the next action deterministically—think of a state machine guided by model suggestions. Tools are executed through adapters with strict input/output schemas. Memory is split into a short‑term scratchpad and a long‑term store (vector DB or RAG). Policies enforce limits and guardrails.
Tools with contracts
Wrap every tool—search, CRM, spreadsheets—with a schema. Validate inputs before execution and outputs after. Fail closed: if the output doesn’t match the schema, the step retries with a reduced scope, then escalates.
Planning and retries
Use a deterministic loop: propose → validate → act → observe → summarize. Add exponential backoff and a maximum budget in tokens and tool calls. Log each step with inputs, outputs, and costs; these logs power evaluation and incident review.
Memory that helps, not hurts
Keep a scratchpad for the current task and a separate long‑term memory for reusable facts (RAG). Always trace which memories influenced a decision to make reviews straightforward.
Guardrails and evaluation
Allowlist domains and tools. Add automatic checks: unit tests over tool outputs, self‑consistency for critical decisions, and human‑in‑the‑loop for high‑risk actions. Maintain a golden set of tasks and compare planner versions by success rate, cost, and latency.
When to use agents
Prefer simple functions for single‑step tasks. Choose agents for workflows with branching, search, or orchestration across multiple systems—research, summarize, enrich, and update records, for example.
With these practices, agents become boring—in the best way. They do the job, stay within budget, and leave an audit trail when things go wrong.