ai-agentllmmemorycontext-windowcompactionbeginners

How Does an AI Agent Remember? — Multi-Turn and Memory Compaction Made Simple

An LLM actually re-reads the whole conversation from scratch every time. With analogies and diagrams, this post explains how agents still seem to 'remember,' and how memory compaction keeps long conversations from overflowing.

Data DynamicsJune 24, 202612 min read

After chatting with an AI agent for a while, you hit a magical moment. You told it "my name is Byeonggon" half an hour ago, and much later it still greets you with "Hi Byeonggon!" It feels like it remembers, just like a person. But you also run into something strange: in a very long conversation, it suddenly forgets what happened early on, or you see a message like "the conversation got long, so I compacted part of it."

These two experiences come from the same single fact: an LLM cannot remember anything on its own. The fact that an agent appears to remember is the result of clever design, and the limits of that design show up exactly when "the context fills up." This post explains how it all works, with almost no code — just analogies.

What you'll learn

That an LLM actually re-reads the entire conversation from scratch on every turn (the amnesia analogy)

The real mechanism behind multi-turn that makes it look like memory

Why long conversations hit a wall — the context window as a desk

The key technique to push past that wall: memory compaction = meeting minutes

What to keep and what to drop, and the common failure modes

Short-term vs long-term memory so it remembers across sessions

1. The shocking truth: an LLM is an amnesiac

Picture the movie trope of someone with short-term amnesia who forgets what just happened after a few minutes. How do you have a real conversation with this person? There's only one way: every time you meet, you re-read the entire conversation so far back to them.

An LLM is exactly like this. The model itself does not store the conversation. Once it produces an answer, it forgets even what it just said. The technical word for this is stateless.

The key: an LLM doesn't "continue a conversation." It's a machine that reads one long passage it's seeing for the first time and writes the next line.

So how does a chatbot or agent know what you said yesterday, or ten minutes ago? The secret isn't in the model — it's in the program wrapped around the model (the agent).

2. How multi-turn really works — "re-send everything every time"

A back-and-forth conversation is called multi-turn. One question-and-answer is a "turn," and several turns chain together. The mechanism is simpler than you'd think.

On every turn, the agent re-writes the entire conversation so far, from the beginning, and sends it to the model.

On the third turn, what the model actually receives looks like this:

[system] You are a friendly assistant.
[user] My name is Byeonggon.
[assistant] Hi Byeonggon!
[user] What was my name again?

The model reads this passage as if for the first time and produces the next line, "Byeonggon." It didn't answer from memory — it answered because the answer was written right in front of it. On the next turn, this answer is included too, and the whole thing is sent again.

Loading diagram…

This diagram is the single most important picture in this post. The true identity of "memory" is "the conversation log re-attached every single time." Half of what an agent does is just managing this attachment.

In one sentence: multi-turn isn't the model's memory — it's the agent's diligence in re-attaching the conversation log every time.

3. But the desk is too small — the context window

A natural question follows. "So if a conversation goes on 10,000 turns, you re-send a 10,000-turn passage every time?" Yes — and that's the problem. There's an upper limit to how long a passage the model can read at once. This limit is called the context window.

Think of it as the model's desk size. It can only read the papers placed on the desk; anything off the desk effectively doesn't exist. The unit for measuring how much fits on the desk is the token — roughly, think one word ≈ one or two tokens and you're fine.

Let's see how the desk fills as a conversation grows. Agents fill the desk far faster than chatbots, because it's not just the user conversation piling up — tool results pile up too.

What piles up on the desk	Example	Volume
System prompt	"You are an assistant that…" rules	Small but always present
User/assistant turns	Questions and answers	Medium
Tool execution results	Reading 200 lines of a file, 50 search hits, an API response	Very large ← the culprit
Model's intermediate reasoning	"First let me check A…"	Large once it accumulates

What happens when the desk is full? Three problems:

Overflow: you can't add more. Past the limit, the request simply fails.
Dropping the middle: with too many papers, the model tends to read the start and end well but skim the middle (often called "Lost in the Middle").
Slow and expensive: re-reading the whole desk every turn means longer conversations are slower and cost more.

So an agent that seems to suddenly forget early details in a long conversation hasn't lost a memory — it cleared that paper off the desk. What to clear is our next topic.

4. The key technique: memory compaction = meeting minutes

When the desk is nearly full but the conversation must continue, what do we intuitively do? Summarize the old papers into a single page to free up space. It's just like keeping a one-page set of minutes after a long meeting and throwing away the verbatim transcript.

That's exactly memory compaction. The name sounds grand, but the idea is simple.

Compaction: when the context (desk) is nearly full, compress the old conversation log down to its essentials as a summary, reclaiming space.

As a picture:

Loading diagram…

Just hold on to two instincts:

Don't touch recent turns. What was just said matters most, so keep it verbatim. The compression targets the old papers.
Let the model do the summarizing. You ask the LLM one more time — "summarize the conversation so far, essentials only" — and replace the old log with the result.

Easy to say, but there's a dangerous trap here: what you keep in the summary versus what you drop. Drop the wrong thing and the agent forgets "what we just decided" and veers off course.

5. What to keep, and what to drop

Think about writing meeting minutes. You don't record "who drank water at what time," but you absolutely record "decided to proceed with plan A by next week." Compaction makes the same judgment.

✅ Must keep	❌ Safe to drop
Goals/tasks not yet finished	Verbose intermediate steps of finished work
Decisions/preferences/constraints the user gave ("in Korean," "don't touch the files")	Dead ends that were tried and abandoned
Key facts (names, environment, versions, numbers)	Duplicate tool outputs saying the same thing
File paths/identifiers being worked on	Full bodies of long files seen once and not reused

It especially helps to handle tool outputs well, since they eat the most desk space in an agent. For example, if you read a 200-line file, instead of keeping all 200 lines, keep a one-line conclusion like "confirmed that function X in that file does Y" plus a pointer to where it can be re-read. If you need the original again, just read it again later.

The essence of compaction isn't "deletion" — it's translation. You're transcribing long, raw records into short, meaningful conclusions.

A good summary follows a fixed template

Free-form summaries ("well, we talked about a few things and…") easily drop something important. So in practice, people force the summary into a fill-in-the-blanks form. You give the model a template like this:

Summarize the conversation so far into the fields below.
Write "none" for any empty field:
 
- User's goal:
- Decisions/constraints made so far:
- Task currently in progress:
- Key facts learned:
- Next steps:

Using a template instead of free writing noticeably reduces the accident where the "task currently in progress" field comes out empty. The form itself is the safety net.

When to start compacting

Too early and you lose perfectly good context; too late and the desk overflows and the request fails. The most common rule is to trigger it when the desk reaches a certain fill ratio.

if used_tokens / context_window_size > 0.8:
    compress the older half into a summary

Triggering around 70–80% full is typical. You don't wait until it's right at the brink — you tidy up ahead of time while there's still room.

6. Remembering across sessions — short-term vs long-term memory

The compaction we've discussed is a technique for tidying the desk within a single conversation. You could call this working (short-term) memory — the papers spread on your desk that you're using right now.

But what we really want is different: closing today's chat and reopening a fresh window tomorrow, and having it recognize "ah, that project from yesterday." Short-term memory alone can't do this, because the desk gets wiped clean when the conversation ends. This is where long-term memory comes in.

	Short-term memory	Long-term memory
Analogy	The desk in front of you	The drawers/filing cabinet beside it
Scope	This one conversation	Across all conversations
Method	Context window + compaction	Stored externally, pulled in when needed
Disappears?	Gone when the conversation ends	Kept permanently

The core of long-term memory is: "write it down somewhere off the desk, and put it on the desk only when needed." There are two main implementations:

Retrieval-based (RAG memory): store old conversations/documents in an external store, and search out only the pieces relevant to the current question to place on the desk. Good when the volume is huge.
File-based (fact notes): write key facts as small files — "user name = Byeonggon," "preferred language = Korean." When the next session starts, you lay just these notes on the desk first. (The very tool used to build this blog uses exactly this kind of memory — one fact per file.)

What both share, and the heart of the idea: the desk (context window) is expensive and small, so keep things in the drawers by default and bring up only what you need, when you need it.

7. One detail beginners often miss

Finally, one trap that bites surprisingly often in practice: the tension between compaction and "prompt caching."

Many LLM services use prompt caching to cut speed and cost. It lets the model quickly reuse the front part of the conversation that's attached identically every time, treating it as "the same thing as before." As long as you don't touch the front papers on the desk, the cache works well.

But compaction swaps out exactly those front papers wholesale. The moment you replace the old conversation with a summary, the cache goes "wait, the front changed?" and is invalidated. So right after a compaction, one response can be slow and expensive.

Lesson: compaction isn't free. It lets you clear the desk so you can go further, but at the moment you clear it you pay the cost of throwing away the cache. That's why it's better to tidy in one big pass when you hit the threshold, rather than a little bit every turn.

8. Wrap-up — one-page summary and a checklist

We took the long way, but it all converges into a single sentence.

An LLM can't remember. So the agent re-attaches the conversation every turn (multi-turn), compresses old records into a summary when the desk fills (compaction), and writes down what must survive into the drawers (long-term memory).

A checklist for when you design or pick an agent's memory:

Do you have a rule for what to keep (goals/decisions/facts vs duplicate outputs/dead ends)?
Do you force the summary into a template, not free writing?
Is the compaction trigger point defined (e.g., desk at 70–80%)?
Do you shrink large tool outputs into a conclusion + location pointer?
Is long-term memory separated out for cross-session recall?
Have you accounted for the cache cost of compaction?

The impression that an AI agent "remembers intelligently" is really the result of diligently doing these simple bits of housekeeping. It's not magic — it's tidying up.

Coming next: this series will soon cover implementing long-term memory with RAG and multi-agent memory where several agents share what they remember — at the same beginner-friendly level.