Building LLM Agents - Giving Models the Ability to Act, Evaluate, and Iterate // Megha Bose

Helpful context:

A language model that responds to a single prompt is powerful. A language model that can take actions, observe results, evaluate whether it succeeded, and try again is qualitatively different. It can write code and run it. It can search the web, read a result, and decide what to search next. It can draft a document, critique its own draft, and revise until it meets a quality bar. This is the shift from “model” to “agent.”

Building agents has become a core ML engineering skill because the gap between what a model knows and what a model can do depends entirely on what tools and loops you wrap around it. The model itself is fixed. The agent architecture determines whether it can use that knowledge to accomplish real tasks.

The Core Idea: Action-Observation Loops

A pure language model takes a context and returns a completion. An agent adds a loop: it can decide to take an action (call a tool, run code, query a database), observe the result, and continue reasoning with that new information. The loop runs until the agent produces a final answer or hits a stopping condition.

This loop is the fundamental building block. Everything else - planning, memory, multi-agent coordination, quality evaluation - is a different way of structuring or extending this loop.

The simplest instantiation: give the model a list of tools with descriptions, let it choose which tool to call and with what arguments, execute the tool, append the result to the context, and repeat. This is exactly what OpenAI’s function calling, Anthropic’s tool use, and the open-source LangChain and LlamaIndex frameworks implement.

tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information",
        "parameters": {"query": "string"}
    },
    {
        "name": "run_python",
        "description": "Execute Python code and return stdout",
        "parameters": {"code": "string"}
    }
]

# The loop
messages = [{"role": "user", "content": user_task}]
while True:
    response = llm.complete(messages, tools=tools)
    if response.finish_reason == "tool_call":
        tool_result = execute_tool(response.tool_call)
        messages.append(response)
        messages.append({"role": "tool", "content": tool_result})
    else:
        break  # final answer

The key design decision embedded here: the model chooses when to stop. It can make 0 tool calls (simple question) or 20 (complex research task) depending on what the task requires.

ReAct: Interleaving Thought and Action

The ReAct pattern (Yao et al., 2022) structures the loop explicitly: at each step, the model produces a Thought (reasoning about what to do), an Action (tool call with arguments), and receives an Observation (tool result). This chain continues until the model produces a final answer.

Thought: The user wants to know the population of the capital of Japan. I need to identify the capital first.
Action: search_web("capital of Japan")
Observation: Tokyo is the capital of Japan.

Thought: Now I need the population of Tokyo.
Action: search_web("population of Tokyo 2024")
Observation: Tokyo's population is approximately 13.96 million (city proper) or 37 million (greater metropolitan area).

Thought: I have enough information to answer.
Answer: The capital of Japan, Tokyo, has a city population of about 14 million and a metro population of about 37 million.

The Thought step is crucial and often underappreciated. It makes the model’s reasoning visible (useful for debugging), and it helps the model plan its next action better by explicitly stating what it has learned and what it still needs. Without the thought step, models tend to take actions more impulsively - calling a tool without reasoning about whether the result will actually help.

Why ReAct works better than pure chain-of-thought: chain-of-thought generates a long reasoning trace from the model’s parametric memory alone. For tasks requiring current information, arithmetic, or code execution, that memory is unreliable. ReAct grounds each reasoning step with real observations, correcting the model’s course when its initial assumptions are wrong.

Working With Different Types of Data

Agents become dramatically more capable when they can work across data modalities. The tool interface makes this possible without changing the core loop - you simply add tools that can handle different data types.

Structured data (databases, spreadsheets): give the agent a tool to execute SQL or pandas operations. The agent reads a schema description, plans a query, executes it, and interprets the results. This is far more reliable than asking the model to “remember” the structure of a table from its training data.

def query_database(sql: str) -> str:
    """Execute SQL and return results as a formatted string"""
    try:
        results = db.execute(sql).fetchall()
        return format_as_table(results)
    except Exception as e:
        return f"SQL error: {e}"  # agent can see and fix the error

Code execution: a code interpreter tool lets the agent perform precise computation, generate plots, run tests, and debug. This is how you get reliable arithmetic and data analysis from models that struggle with long calculations in pure text.

Documents and knowledge bases: a retrieval tool lets the agent fetch relevant content on demand rather than depending on the context window to hold everything. The agent decides what to retrieve based on what it currently needs - a form of active memory management.

APIs and external services: agents can call any service with a well-described interface. A weather API, a payment processor, a calendar system - the agent needs only the tool description to use them correctly.

Multimodal: modern models can reason over images, audio, and video within the same context window. An agent working on a design task might take a screenshot, describe what it sees, modify code, re-screenshot, and compare - all within one loop.

Self-Evaluation and Quality Gates

The most important pattern for reliable agents: build evaluation into the loop. Instead of hoping the model produces a good answer on the first try, make the agent evaluate its own output and iterate until it meets a quality bar.

Code agents are the clearest example. The agent writes code, runs it, reads any errors or test failures, and revises. The evaluation is automatic: the code either passes the tests or it doesn’t. Human judgment is not required in the loop.

def coding_agent(task: str, tests: str) -> str:
    code = ""
    for attempt in range(5):
        code = llm.complete(f"Write Python code for: {task}\n\nPrevious attempt: {code}")
        result = run_tests(code, tests)
        if result.passed:
            return code
        # Feed failure back to model
        messages.append({"role": "tool", "content": f"Tests failed:\n{result.errors}"})
    return code  # return best attempt after max retries

Research and writing agents need a different kind of evaluator. One pattern: use the model itself as the evaluator. After producing a draft, the agent runs a second prompt: “Does this answer the user’s question completely and accurately? What is missing or wrong?” This critic output becomes the input for the next revision.

Using a stronger model as evaluator: a common production pattern is to generate with a fast, cheap model and evaluate with a slower, more capable one. The evaluator checks for accuracy, completeness, and safety before the result is returned to the user. This gives you the throughput of the smaller model with the quality floor of the larger one.

The key engineering insight: quality gates should be checkable. Define exactly what “good enough” means before building the agent. If you cannot define the quality bar programmatically (or with a reliable LLM judge), the agent cannot iterate toward it.

Memory: What the Agent Knows and Remembers

An agent’s context window is its working memory - everything it currently knows. But context windows are finite, and tasks can span many steps. Agents need strategies for what to keep in context and what to offload.

In-context memory is everything in the current conversation. It is fast and directly accessible, but limited in size. Long agent runs accumulate tool outputs, intermediate reasoning, and revisions that eventually overflow the context. Strategies: summarize earlier steps before adding new ones, or maintain a “scratchpad” of key facts that gets updated and compressed.

External memory (retrieval) stores information outside the context and retrieves it on demand. A vector database of past conversations, a key-value store of facts the agent has learned, or a structured database of domain knowledge. The agent retrieves relevant chunks as needed using semantic search or exact lookup.

Episodic memory is a record of what the agent has done - actions taken, results observed, decisions made. This is essential for long-horizon tasks where the agent needs to avoid revisiting dead ends or repeating successful steps. A simple implementation: maintain a log and include a summary in each new prompt.

Procedural memory is the agent’s knowledge of how to use its tools effectively - which tools work well for which situations, common error patterns and fixes, successful strategies. This can be encoded in the system prompt, retrieved from a few-shot example store, or fine-tuned into the model.

Planning: Breaking Hard Tasks Into Steps

Simple tasks can be handled by a reactive loop (act, observe, respond). Complex tasks require planning: the agent needs to figure out what steps to take before taking them, because early choices constrain later options.

Plan-and-execute: the agent first generates a complete plan (a list of steps), then executes each step in sequence. This makes reasoning explicit and allows the user to inspect or correct the plan before execution.

Plan:
1. Fetch current exchange rates for USD/EUR and USD/JPY
2. Download the company's Q3 revenue data
3. Convert all figures to USD
4. Generate a bar chart
5. Write a two-paragraph summary

Executing step 1...
Executing step 2...

Replanning: pure plan-and-execute fails when the world does not match the agent’s expectations. A better pattern: execute the plan step by step, but after each step, check whether the plan needs to be revised based on what was actually observed. If step 3 reveals an unexpected data format, replan steps 4-5 accordingly.

Hierarchical planning: for very complex tasks, the agent maintains a high-level plan (write a research report) and generates sub-plans for each step (outline, draft section 1, research claims in section 1, revise section 1). This prevents the context from being overwhelmed by the details of one sub-task.

Multi-Agent Patterns

Some tasks are parallelizable, some require specialization, and some are too large for a single agent’s context window. Multi-agent architectures address all three.

Parallelism: a coordinator agent decomposes a task into independent sub-tasks and spawns worker agents to handle each in parallel. A research agent might split “summarize 10 papers” into 10 parallel summarization agents, then aggregate the results. This reduces latency by running subtasks concurrently.

Specialization: different agents have different system prompts, tools, and fine-tuning. A coding agent has access to code execution tools and a system prompt focused on software engineering. A data analysis agent has access to database tools and visualization libraries. A coordinator routes tasks to the appropriate specialist.

Self-critique with a separate agent: instead of asking the same model to generate and evaluate, use two separate agent instances - one generates, one critiques. The generator is optimized for creativity and completeness; the critic is optimized for accuracy and catching errors. Separating these roles reduces the “self-congratulatory” failure mode where the model approves its own output uncritically.

Debate: two agents take opposing positions on a question and argue for their view. A judge agent evaluates the arguments. Research shows this can improve accuracy on hard questions by forcing both models to steelman alternative views.

Practical Engineering Considerations

Building reliable agents is not just about the model - it is about the engineering around it.

Determinism and retries: LLM outputs are stochastic. The same input can produce different tool calls or different plans across runs. Make your agent loops retry on errors, log all tool calls and results, and use deterministic evaluation where possible (run tests, not just ask the model “does this look right?").

Cost and latency: each iteration of the loop costs tokens and time. A naively designed agent that loops 20 times on a simple task is expensive and slow. Set hard limits on iterations. Use cheap models for generation and expensive models for final evaluation. Cache tool results when the same query appears multiple times.

Tool design matters enormously: poorly designed tools cause agents to fail. A tool that returns a wall of text when the agent needed one number, or that errors out with cryptic messages, derails the agent’s reasoning. Design tools with agent use in mind: return structured, concise output; return informative error messages the agent can act on; validate inputs before executing.

Safety and sandboxing: agents that execute code, call APIs, or modify databases need guardrails. Run code in isolated containers (Docker) with no network access and resource limits. Require confirmation before irreversible actions (sending emails, making purchases, deleting data). Log all tool calls with inputs and outputs for auditing.

Failure modes to watch for:

Hallucinated tool calls: the agent invents arguments for a tool that don’t make sense. Add input validation to all tools.
Infinite loops: the agent keeps trying the same failing approach. Track which strategies have failed and include that in the prompt.
Context overflow: long runs fill the context with tool outputs. Summarize and prune aggressively.
Goal drift: over many steps, the agent loses track of the original task. Re-include the original task in each prompt explicitly.

Evaluating Agent Systems

Evaluating agents is harder than evaluating classifiers because agents produce sequences of actions, not single outputs.

Task completion rate: does the agent successfully complete the task? Define a clear success criterion upfront (tests pass, document meets quality rubric, data is accurate). Measure on a diverse test set.

Efficiency: how many tool calls, tokens, and dollars does the agent spend? A correct answer reached in 3 steps is better than one reached in 15. Track and optimize.

Error recovery rate: when the agent makes a mistake (wrong tool call, incorrect plan), does it recover? Log every time an error occurs and measure how often the agent self-corrects vs. gives up or compounds the error.

Trajectory evaluation: use an LLM judge to score the agent’s reasoning at each step. Did it correctly interpret the tool output? Did it update its plan appropriately? This gives insight into where the reasoning breaks down before the final output fails.

Build a test suite of realistic tasks before deploying. Agents fail in subtle ways that only appear at the edge cases of real usage - incomplete plans, misread tool outputs, context overflows mid-task. A test suite is the only way to catch these systematically.

Concept	Key point
Agent loop	Action, observation, reasoning, repeat; the core pattern behind all agents
ReAct	Interleave Thought, Action, Observation; makes reasoning grounded and visible
Tool design	Tools are the interface to the world; poor tools break good agents
Self-evaluation	Build quality checks into the loop; iterate until the bar is met
Memory	In-context for fast access; external retrieval for scale; episodic for long tasks
Plan-and-execute	Generate a plan, then execute step by step with replanning on surprises
Multi-agent	Parallelism, specialization, and critique via separate agents
Evaluation	Task completion, efficiency, error recovery - all need explicit test suites

Read Next: