Listen to this post (narrated by AI)

One of the persistent frustrations with AI agent deployments is that they don't get better over time. An agent that handles its first customer support ticket with a certain level of competence handles its thousandth ticket at roughly the same level. Unlike a human colleague who builds intuition through experience, the agent starts fresh every time, limited to whatever its prompt and training data provide.

There's a surprisingly simple technique that helps with this, and it's more effective than you might expect given how straightforward it is.

The basic idea

After each significant task, you ask the agent to reflect on what happened. What went well? What would you do differently? What general principle did you learn that might apply to future tasks?

You store those reflections in a structured format and inject the relevant ones into the agent's context on future runs. Over time, the agent builds up a library of compressed lessons that inform how it approaches new tasks.

This works because language models are quite good at pulling out generalizable principles from specific experiences when you explicitly ask them to. They're much less good at doing this on their own from raw logs or unstructured history. The reflection step forces the model to compress an experience into something reusable, and the structured storage makes it easy to retrieve later.

What the implementation looks like

The reflection prompt is simple. After a task completes, you append something like: "Before moving on, briefly note what went well, what you'd change, and one general principle for similar tasks." The agent produces a short reflection, usually a few sentences, that you store as a structured record with the task type, what the agent observed, and the principle it pulled out.

On later runs involving similar tasks, you pull the relevant reflections and include them in the context. Something like: "Based on previous experience: validate state before irreversible actions; explicit waits are more reliable than implicit ones; confirm totals before submitting." The agent incorporates these naturally, the same way a person might benefit from reading their own notes before starting a familiar task.

The key design decisions are what to reflect on (not every minor action, focus on significant tasks or failure points), how many reflections to store (20 to 30 of the most relevant keeps the context manageable), and how to retrieve them. Semantic similarity to the current task works well, but even simple keyword matching is fine for many use cases.

Some practical results

In one deployment with a customer support agent handling roughly 500 tickets per week, this pattern produced measurable improvements over four weeks. First-contact resolution improved from 62% to 79%, escalation rates dropped from 28% to 12%, and average handling time decreased from 4.2 minutes to 3.1 minutes. The underlying model didn't change during this period. The improvement came entirely from accumulated reflections being fed back into context.

These numbers come from a specific deployment and may not generalize to every situation, but they show the kind of improvement that's possible. The agent was learning from its own experience, not in the machine learning sense of updating weights, but in the practical sense of applying lessons from past interactions to new ones.

Why this works better than the obvious alternatives

The most common approach to making agents "learn" is to log everything and dump the logs into context, hoping the model will spot useful patterns. This tends to fail because raw logs are noisy, and context windows fill up with irrelevant detail before the model can find the signal. The reflection step does something important: it forces compression. Instead of giving the model ten pages of logs, you're giving it ten sentences of distilled insight.

You can extend the pattern further. Periodically asking the agent to review its last 10 to 20 reflections and identify themes produces a useful meta-level of learning. The agent might notice that it's consistently underestimating how long certain tasks take, or that it's being overly cautious with error handling. You can also share reflections between agents working on similar tasks, giving a new agent the benefit of another agent's experience. These are natural extensions that build on the same basic idea.

Where it doesn't work

This pattern is best suited to repetitive tasks where the agent encounters similar situations over time. Customer support, data processing, routine automation, code review: anything with enough repetition for patterns to emerge and enough consistency for past lessons to stay relevant.

It's less useful when every task is genuinely unique, when there's no feedback loop to tell the agent whether its approach worked, or when the environment changes so fast that past lessons become misleading. It also adds token cost (budget roughly 200 tokens per reflection) and requires some infrastructure for storing and retrieving reflections, though this can be as simple as a JSON file.

There's also a real risk of the agent learning the wrong lessons. Models can pick up on false patterns just as easily as real ones, especially with small sample sizes. Human review of the accumulated reflections matters in the early stages, and it's worth periodically pruning reflections that seem too specific or questionable.

Getting started

If you want to try this, the simplest version is just appending "Briefly: what would you do differently next time?" to the end of your agent's task completion prompt, saving the response, and including it in context on the next similar task. That's enough to see whether the pattern helps for your specific application before investing in anything more sophisticated.

The broader point is that we tend to treat AI agents as static tools. Configure the prompt, deploy, and hope it works. But there's a real difference between an agent that uses the same approach every time and one that accumulates practical lessons from its own experience, even if those "lessons" are just compressed text injected into a context window. The technique is simple enough that there's not much reason not to try it.