From Chatbot-With-Tools to Actual Agent: The Control Layer You're Missing

Author: Giorgos Lysandrou

Without control, you’ll keep getting chatbot failure modes just with more expensive mistakes.

Executive Summary

Most AI teams chasing better agent performance reach for the same levers: bigger context windows, more documents, smarter prompts. This article argues that's the wrong instinct entirely. The missing ingredient isn't more information. It's control. A well-designed control layer is what separates an agent that works in a demo from one that works in production.

More information isn't the answer: Giving an AI agent a bigger memory, more documents, or a longer context window doesn't make it smarter — it just makes it slower and more expensive. The real gains come from teaching the agent to choose what it needs, when it needs it, rather than consuming everything at once.
Reliability comes from the loop, not the model: The difference between an agent that impresses in a demo and one that holds up in production isn't the quality of the AI — it's whether the system checks its own work. Agents that plan, act, observe, and verify at each step catch their own mistakes instead of confidently getting things wrong.
Control is the missing ingredient: Most AI agents today are essentially chatbots with extra steps — they lack any mechanism to know when they're on track, when to stop, or when to try a different approach. Adding a proper control layer — clear success criteria, structured state, and validation checks — is what turns an agent-shaped object into something you can actually trust.

What did you have for lunch yesterday?

You probably didn’t replay every memory you’ve ever had until you hit “yesterday + lunch”. You jumped straight to the part of your experience where those concepts live. That’s a useful mental model for building agents:

A giant context window is not memory.
A pile of retrieved documents is not understanding.
A long chain-of-thought is not reliability.

Those are ingredients. But the thing that makes an agent feel like an agent is the same thing that makes your brain not brute-force your entire life history: control.

A recent survey — Agentic Reasoning for Large Language Models — did a great job summarising (and naming) the shift many of us have been feeling while building: from reasoning inside the model to reasoning through interaction. This post isn’t a summary of that paper. It’s an attempt to translate the shift into practical system design:

If you build agents like chatbots-with-tools, you’ll keep getting chatbot failure modes — just with more expensive mistakes.

The Old Game vs The New Game

For a while, our default playbook for “make the model smarter” was essentially: better prompts, chain-of-thought, self-consistency / sampling-based improvements, and maybe some search.

ReAct was a hinge moment because it made “thought → action → observation” feel natural. But notice the implicit constraint: a lot of this still ends up as “one-shot inference, but with more tokens”. The survey’s framing is sharper: agentic reasoning emphasises scaling test-time interaction — turning inference into an iterative process where the model, memory, and environment all stay in the loop.

If you’ve built (or used) agents that feel impressive in demos but brittle in real workflows, this is for you.

The Accidental Agent and What a Lot of “Agents” Look Like Today

Let me describe a pattern I’ve seen a lot (and definitely built versions of myself):

Take a good chat model
Add a few tools (search, DB query, maybe code execution)
Add RAG
Add a “you are an autonomous agent” system prompt
Wrap it all in a while-loop until it stops or times out

Congrats, you have an agent-shaped object. But it tends to fail in predictable ways:

Context bloat: every observation gets appended; prompts become archaeological layers.
Tool flailing: “wrong tool but confidently” becomes the default failure mode.
No stop conditions: it keeps going because it can, not because it should.
No grounding discipline: it doesn’t notice it’s wrong unless you force it to.
Memory = chat history: which is basically writing logs and calling it learning.

This is why “agents” often feel magical in demos and messy in production. Our experience putting agentic systems in production reflects this too: once you’re no longer evaluating a model but a system, the failure modes include navigation, tool hygiene, context pruning, and evaluation design — not just “did the model answer correctly”.

So the question becomes: what’s the intended agent?

Intended Agents In The Real World: Booking a Flight

To make this less abstract, here’s a toy workflow most people can picture: “Book me a flight from London to New York next Tuesday. Arrive before 6pm. Keep it under £900. Aisle seat.”

The Old Pattern: Chatbot-with-tools

A common “agent-shaped” implementation looks like:

Immediately retrieves a bunch of airline / travel policy docs (even if none are needed yet).
Calls a search tool, pastes a long list of results into the prompt, and “picks one”.
Books prematurely without verifying constraints (arrival time / baggage / seat / policy).
If it fails, it retries in a slightly different way — but without a clear notion of what changed or what it learned.

The failure mode isn’t that the model can’t reason — it’s that the system doesn’t control the workflow.

The Improved Pattern: The Agentic Loop

A more agentic version treats the task as an interactive process with explicit state and checks:

PLAN: restate constraints + list missing info (e.g., “which airport preference?” / “is 1 stop ok?”).
ACT: call flight search with a structured query (date window, arrival constraint, budget).
OBSERVE: store results in a compact state object (top 5 candidates with price/arrival/layovers), not a giant pasted blob.
UPDATE: refine query if constraints aren’t met (e.g., “arrival before 6pm is too strict — widen time window or raise budget?”).
VERIFY: run validators (“arrival < 18:00”, “price ≤ £900”, “policy compliant”, “seat selection available”).
STOP: only once the booking API returns a confirmation and all validators pass.

What changed is subtle but decisive. Retrieval is conditional (not a reflex), context is managed (state is structured, not accumulated), and verification is in the loop (not left to the user). Swap “booking a flight” for “creating a purchase order”, “issuing a refund”, “changing a production config”, or “shipping a PR” and it’s the same story: once the agent can act, the loop matters more than the prompt.

The Intended Agent: Explicit Context, Explicit State, Explicit Verification

The survey mentioned above organises agentic reasoning into three layers: foundational (planning/tool use/search), self-evolving (feedback + memory), and collective (multi-agent coordination).

But the deeper idea is: reasoning becomes the organising principle for planning, decision-making, and verification — not just generating a plausible chain-of-thought. That sounds abstract until you map it onto what changes in your architecture. There are three core points to remember:

1) Context Is a Resource, Not a Dumping Ground

A good agent shouldn’t treat retrieval as “always do it.” Retrieval is a decision, not a reflex.

Here’s a practical heuristic:

If your system retrieves on every turn, you haven’t built retrieval — you’ve built a context tax.

This shows up in real work all the time. When debugging a production incident: you don’t dump all logs into context; you decide which metrics/logs to fetch next based on your current hypothesis. That’s “agentic retrieval”. Here's a more concrete pattern:

Decide if you need retrieval
If yes: draft a query, fetch, skim, extract
If evidence conflicts: fetch again
Only then synthesise

This is also where “agentic RAG” starts to differ from traditional RAG: retrieval becomes a deliberate reasoning step, not a default pipeline stage.

2) State is Explicit (and Inspectable)

The moment you stop evaluating “a model” and start evaluating “a system”, state tracking and tracing start to matter.

One thing the industry has gotten more explicit about by now is observability for agent workflows. For example, OpenAI’s Agents SDK ships with built-in tracing and a Traces dashboard that records agent runs (generations, tool calls, handoffs, guardrails, custom events), specifically so you can debug and audit what happened step-by-step.

That’s not “nice to have.” It’s the difference between a system you can debug and a system you can only vibe-check.

3) Verification Is Not Optional

The most actionable part of the survey, in my opinion, is how direct it is about feedback. It breaks feedback into three regimes of reflective feedback (generate → critique → revise), parametric adaptation (learn via fine-tuning / RL) and validator-driven feedback (retry until a validator passes).

Most teams should start with validator-driven feedback because it’s boring and effective. If you can write any validator that unit tests, checks schema, sets business rules / constraints (“no refunds above X without escalation”), or establishes factuality (“citations required”), then you can turn non-deterministic model output into something you can actually trust.

One of the “unknown unknown” shifts here is simple: in agent land, reliability often comes more from the loop than the model.

A Concrete Pattern: Plan → Act → Observe → Update

Here’s the smallest loop discipline I’ve found that reliably improves behaviour without training:

Operate in steps: Plan → Act → Observe → Update,
After each Act, summarise Observation in 1–3 bullets,
Stop when success criteria are satisfied or budget is hit; return best-known result + remaining uncertainties.

This isn’t about making the model verbose. It’s about making the system legible, and forcing “reality contact” at each step. A very engineer-relatable example is CI-style closed-loop grounding:

Plan: propose change list
Act: run tests / lint
Observe: parse failures
Update: patch and retry

How To Spot If Your Agent Feels "Off"

A few questions that tend to expose accidental-agent designs:

“Does my agent choose what to retrieve, or do I always retrieve?”

If retrieval is unconditional, you’ll pay in: latency, cost, context dilution, and higher risk of garbage-in / garbage-out.

“Can my agent notice it’s wrong?”

If your agent’s only feedback signal is “the user gets annoyed,” you’re doing RL with human pain. A validator-driven retry loop is the cleanest way to give it a reality check.

“Is memory writeable, and does it get better over time?”

If your “memory” is just appending chat history, you’re basically writing logs. The survey’s memory framing is important: memory becomes a dynamically growing context that agents refine over time — not just a transcript.

Memory That Actually Helps

Logs tell you what happened while memory tells you what to do next time. Chat history is a transcript. Memory is an evolving policy about what’s worth carrying forward.

A practical starter is a tiny “lessons learned” table organised against a key of task type, tool, failure mode, and a value of what worked and what to avoid. The point isn’t to build a perfect knowledge graph. The point is to create compounding behaviour: memory + feedback turns agents from “stateless helpers” into systems that get better over time.

Multi-agent: Minimum Viable Team, Not Agent Explosion

The temptation is to throw more agents at the problem but this often multiplies coordination overhead. A good “minimum viable team” pattern:

Co-ordinator: decomposes + assigns
Executor: does tool calls / changes
Critic/evaluator: checks correctness/risk
Memory keeper: writes/curates lessons

If you can’t explain what each agent owns, you probably don’t need multiple agents yet.

Practical, Not Prescriptive Takeaways

If we actually buy the paradigm shift, we probably stop stuffing everything into prompts, treating failures as final outputs and evaluating agents like chatbots. And we start treating agents as what they are: software systems where language is the control plane — and reliability comes from the loop.

Before you add another model, add another evaluation loop. Before you retrieve everything, make it conditional. Ship one validator before you ship ten. Treat memory like policy decisions, not a database. And when going multi-agent, start with two agents, not twenty. These aren't rules they're the patterns that survived production.