ai-agentllmmemorycontext-windowcompactionprompt-cachingpythonanthropic

Building Agent Memory Compaction Yourself — From the Multi-Turn Loop to Caching and Long-Term Memory

The intro post covered the 'why.' This one is the 'how.' We implement the multi-turn loop, token measurement, a threshold-based compaction engine, tool-output shrinking, the interaction with prompt caching, and long-term memory — line by line in Python on the Anthropic SDK, with diagrams.

Data DynamicsJune 25, 202617 min read

This is the deep-dive companion to How Does an AI Agent Remember? — Multi-Turn and Memory Compaction Made Simple. Where the intro used analogies — the "desk" (context window), "meeting minutes" (compaction), the "drawers" (long-term memory) — to explain why we need this design, this post turns those analogies into working code.

No more analogies. Instead we write a multi-turn loop, actually measure tokens, attach a compaction engine that fires at a threshold, shrink tool outputs, place everything so it doesn't fight prompt caching, and finally layer on long-term memory that survives across sessions. The code uses the Anthropic Python SDK (anthropic) and Claude models, but the structure ports directly to any LLM API.

What we build here

A multi-turn loop that carries state in a message array

Instrumentation that measures real occupancy with count_tokens

A compaction engine that fires at a threshold (split → summarize → replace)

A template-enforced summary prompt plus validation

Shrinking huge tool outputs to a conclusion + pointer

Placing compaction so it doesn't break prompt caching

File-based / retrieval-based long-term memory across sessions

A single Agent class that ties it all together

Premise — the one-liner from the intro: "An LLM is stateless. Multi-turn means re-sending the entire conversation every turn; compaction means swapping the old part for a summary." That sentence is the skeleton of this whole post.

1. State is just one message array

First, let's be clear: an agent's "memory" is not a special data structure — it's one message array (a list). Each turn we append the user message, send the whole array to the model, and append the returned answer back.

import anthropic
 
client = anthropic.Anthropic()  # uses the ANTHROPIC_API_KEY env var
 
MODEL = "claude-sonnet-4-6"
SYSTEM = "You are a senior engineer helping operate a data platform. Answer concisely."
 
# This list IS the 'short-term memory.' Nothing more, nothing less.
messages: list[dict] = []
 
def chat(user_text: str) -> str:
    messages.append({"role": "user", "content": user_text})
 
    resp = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        system=SYSTEM,
        messages=messages,          # ← re-send the ENTIRE thing every turn
    )
 
    assistant_text = "".join(
        block.text for block in resp.content if block.type == "text"
    )
    messages.append({"role": "assistant", "content": assistant_text})
    return assistant_text

The intro's key picture shows up directly in code. messages accumulates, and the create() call takes all of it as an argument every time. The model holds no state, so the only thing holding state is this messages list.

Loading diagram…

The problem: leave this loop alone and messages grows without bound. An agent that uses tools grows even faster. So we have to measure the size of this array and compress it once it crosses a line. Measurement first.

2. Measure tokens — don't guess

To decide when to fire compaction, you need to know "what percentage of the context window does messages occupy right now?" Estimating by character or word count goes badly wrong once tool outputs, code, and multiple languages mix in. Fortunately the Anthropic SDK offers an endpoint that counts tokens without actually sending the request.

def count_tokens(messages: list[dict], system: str, tools: list | None = None) -> int:
    res = client.messages.count_tokens(
        model=MODEL,
        system=system,
        messages=messages,
        tools=tools or [],
    )
    return res.input_tokens

count_tokens reports pre-call occupancy without charging you (it only uses the separate token-count limit). It matters to include the system prompt and the tool definitions (JSON schemas) in the count — tool schemas are surprisingly heavy and are attached on every single turn.

A practical tip: calling count_tokens every turn adds a network round-trip. So usually you use the usage in the response as the primary signal, and only confirm precisely with count_tokens near the threshold.

# create() responses always carry usage — you know occupancy with no extra call.
resp = client.messages.create(...)
used = resp.usage.input_tokens + resp.usage.output_tokens

Put the context-window size (e.g. 200K, up to 1M depending on the model) in the denominator and you get a fill ratio. That ratio is the input to the compaction trigger.

CONTEXT_WINDOW = 200_000   # the context window of the model you use
COMPACT_AT = 0.75          # compact when 75% full
KEEP_RECENT_TOKENS = 20_000  # keep this much of the most recent turns verbatim

Why 75% and not 100%? Compaction is itself a model call, and that call also costs input tokens. If you wait until full to compress, there's no room left to even send the compaction request. Always fire with headroom.

3. The compaction engine: split → summarize → replace

Now the core. Compaction breaks into three steps.

Split: divide messages into an "old part" and a "recent part." Never touch the recent part.
Summarize: have the model summarize the old part to a fixed template.
Replace: clear the old part entirely and drop the summary into its place.

Loading diagram…

3-1. Split: what counts as "recent"?

The simplest, safest rule is to accumulate tokens from the back until you hit the amount you decided to keep (KEEP_RECENT_TOKENS); that's "recent." We cut by tokens rather than turn count because a single turn can eat tens of thousands of tokens with one huge tool output.

def split_old_recent(messages: list[dict], keep_recent_tokens: int):
    """Accumulate from the back to secure the recent part; return the rest as old."""
    recent: list[dict] = []
    acc = 0
    # walk from the back (recent) toward the front (past)
    for msg in reversed(messages):
        t = count_tokens([msg], system="")  # per-message approximate measure
        if acc + t > keep_recent_tokens and recent:
            break
        recent.append(msg)
        acc += t
    recent.reverse()
    split_idx = len(messages) - len(recent)
    old = messages[:split_idx]
    return old, recent

One trap: in a tool-using agent, an assistant tool call (tool_use) and the user tool result (tool_result) must form a pair. If the split line runs through the middle of such a pair, the API rejects it. So right after splitting you need to nudge the boundary to preserve pairs.

def fix_tool_boundary(old: list[dict], recent: list[dict]):
    """If recent's first message starts with tool_result, its matching tool_use
    is at the end of old. Pull the boundary back one turn to preserve the pair."""
    while recent and _starts_with_tool_result(recent[0]):
        old = old + [recent.pop(0)]  # this line gets summarized away anyway
    return old, recent
 
def _starts_with_tool_result(msg: dict) -> bool:
    content = msg.get("content")
    if isinstance(content, list) and content:
        return content[0].get("type") == "tool_result"
    return False

3-2. Summarize: no free writing, enforce a template

We turn the intro's "fill-in-the-blanks form" into an actual prompt. The key is to not allow free-form narration. Fix the fields, and require empty fields to be explicitly marked "none."

COMPACT_PROMPT = """\
Below is the past conversation log of an agent session. Write a 'handoff note'
distilling only the essentials so work can continue. Follow the field structure
below exactly, and write "none" for any field that doesn't apply. No speculation
or invention — only facts present in the log.
 
## User's goal
## Confirmed decisions/constraints/preferences
## Task currently in progress (most important)
## Key facts learned (names/paths/versions/numbers/identifiers)
## Dead ends tried and abandoned (to avoid repetition)
## Next steps
"""
 
def summarize_old(old: list[dict]) -> str:
    res = client.messages.create(
        model=MODEL,
        max_tokens=2048,
        system="You are a summarizer that compacts conversation logs into handoff notes.",
        messages=old + [{"role": "user", "content": COMPACT_PROMPT}],
    )
    return "".join(b.text for b in res.content if b.type == "text")

After receiving the summary, add a validation step. The most dangerous failure is the "task currently in progress" field coming out empty, so do a light check that it got filled and re-request once if needed.

def validate_summary(text: str) -> bool:
    # minimal check that the template held and the in-progress field isn't gutted
    must_have = ["## Task currently in progress", "## Next steps"]
    return all(h in text for h in must_have)

3-3. Replace: plant the summary back into the conversation

The simplest and safest move is to plant the summary as a single user message at the front of the array. Add a clear marker so the model knows "this is a compressed version of the earlier conversation."

def compact(messages: list[dict]) -> list[dict]:
    old, recent = split_old_recent(messages, KEEP_RECENT_TOKENS)
    if not old:
        return messages  # nothing old to compress
 
    old, recent = fix_tool_boundary(old, recent)
    summary = summarize_old(old)
    if not validate_summary(summary):
        summary = summarize_old(old)  # one retry
 
    marker = (
        "[Summary of earlier conversation — the original was compacted to save context]\n\n"
        + summary
    )
    return [{"role": "user", "content": marker}] + recent

These three functions execute the intro's "meeting minutes" picture with zero analogy left: old (long, verbose past) → summary (one short page) → [summary] + recent (a tidied desk).

4. Shrink tool outputs as they arrive

The intro said "the biggest desk-eater is tool output." If compaction is cleanup after the fact, shrinking tool output is prevention up front. If you read a 200-line file, don't put those 200 lines into the array — keep only a conclusion + a pointer to where it can be re-read.

Loading diagram…

TOOL_RESULT_MAX_CHARS = 4000
 
def shrink_tool_result(name: str, args: dict, raw: str) -> str:
    if len(raw) <= TOOL_RESULT_MAX_CHARS:
        return raw
    head = raw[:1500]
    tail = raw[-1500:]
    pointer = f"{name}({args})"  # re-call with the same args to re-fetch the whole thing
    return (
        f"{head}\n"
        f"... [elided: only part of {len(raw)} chars kept. "
        f"If you need the whole thing again, re-run `{pointer}`] ...\n"
        f"{tail}"
    )

A key design principle hides here: you don't delete it, you discard it in a way that lets you re-fetch. Leave the pointer (tool name + args) and the model can re-call the same tool with the same args to recover the original when needed. Context isn't a cache — it's a workbench, and a workbench has to be clearable.

Wiring this shrink into the tool loop looks like:

def run_tool_and_record(messages, tool_use_block, registry):
    name = tool_use_block.name
    args = tool_use_block.input
    raw = registry[name](**args)              # actually run the tool
    shrunk = shrink_tool_result(name, args, raw)
    messages.append({
        "role": "user",
        "content": [{
            "type": "tool_result",
            "tool_use_id": tool_use_block.id,
            "content": shrunk,                # ← put the shrunk version into memory
        }],
    })

You need both up-front shrinking (tool output) and after-the-fact compaction. The former prevents blowups; the latter cleans up accumulation. Neither alone survives a long session.

5. Prompt caching and compaction collide head-on

Here's the trap the intro foreshadowed, pinned down in code. Anthropic's prompt caching reuses an unchanging prefix at the front of the messages, cutting cost and latency a lot. It treats everything up to a block marked with cache_control as the cache boundary (prefix).

resp = client.messages.create(
    model=MODEL,
    max_tokens=1024,
    system=[
        {"type": "text", "text": SYSTEM},
        # the system prompt barely changes, so it's good to cache
        {"type": "text", "text": LONG_STABLE_GUIDE,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=messages,
)
# verify cache effect via usage
print(resp.usage.cache_creation_input_tokens,  # newly cached this time
      resp.usage.cache_read_input_tokens)       # reused from cache

The problem is clear. Caching works "when the front stays the same," but compaction replaces exactly that front (the old messages) wholesale. The turn where compaction happens invalidates the cache, so that one turn is expensive and slow.

Loading diagram…

Two practical rules fall out of this.

Don't compress a little, often. Touching the front every turn breaks the cache every turn. Compress once, big, at the threshold, then ride the new prefix as a warm cache for many turns afterward — lower total cost.
Align the cache boundary with the compaction boundary. Right after compaction, stamp cache_control at the end of the summary block in the new [summary] + recent structure, so following turns reuse this stable summary from cache.

def with_cache_breakpoint(messages: list[dict]) -> list[dict]:
    """Mark a cache boundary on the leading summary message (call right after compaction)."""
    if not messages:
        return messages
    first = messages[0]
    # normalize content to a block list, then stamp cache_control on the last block
    blocks = (first["content"] if isinstance(first["content"], list)
              else [{"type": "text", "text": first["content"]}])
    blocks[-1] = {**blocks[-1], "cache_control": {"type": "ephemeral"}}
    return [{**first, "content": blocks}] + messages[1:]

Summary: compaction trades throwing away the cache once for an emptied context. So make that once rare and big, and immediately designate the new post-compaction prefix as a cache target to recoup the cost quickly.

6. Long-term memory across sessions

So far we've handled short-term memory (messages) within one session. When the session ends, this array is gone. Time to implement the intro's "drawers."

6-1. File-based (fact notes) — simplest and strongest

Write key facts to small files and lay them into the system prompt at the start of the next session. A "one file = one fact" structure keeps updates and deletes clean (the memory of the very tool used to build this blog works exactly this way).

import json, pathlib
 
MEM_DIR = pathlib.Path("./agent_memory")
MEM_DIR.mkdir(exist_ok=True)
 
def remember(key: str, fact: str):
    (MEM_DIR / f"{key}.json").write_text(
        json.dumps({"key": key, "fact": fact}, ensure_ascii=False))
 
def load_long_term() -> str:
    facts = []
    for f in sorted(MEM_DIR.glob("*.json")):
        d = json.loads(f.read_text())
        facts.append(f"- {d['fact']}")
    return "\n".join(facts)
 
# at session start: inject long-term memory into the system prompt
def build_system() -> str:
    longterm = load_long_term()
    if not longterm:
        return SYSTEM
    return SYSTEM + "\n\n[Facts learned in earlier sessions]\n" + longterm

It's natural to call remember() during the compaction step — when building the summary, also pull out the "facts to preserve until the next session" and drop them to files. You're promoting from short-term (desk) to long-term (drawers).

Loading diagram…

6-2. Retrieval-based (RAG memory) — when the volume is huge

Once facts grow into the thousands, you can't lay them all into the system prompt. Then you put them in an external store (a vector DB, say) and retrieve only the pieces relevant to the current question onto the desk for that turn alone.

def recall_relevant(query: str, k: int = 5) -> str:
    hits = vector_store.search(embed(query), top_k=k)  # embedding similarity search
    return "\n".join(f"- {h.text}" for h in hits)
 
def chat_with_recall(user_text: str):
    # pull in memory needed only for this turn and splice it temporarily
    relevant = recall_relevant(user_text)
    augmented = user_text
    if relevant:
        augmented = f"[relevant memory]\n{relevant}\n\n[question]\n{user_text}"
    messages.append({"role": "user", "content": augmented})
    # ... same as before

File-based and retrieval-based share one philosophy: the context window is expensive and small, so keep things outside by default and bring up only what you need, when you need it. File-based fits "a core small enough to always lay down"; retrieval-based fits "something so vast you must pick." In practice you mix both.

7. Tie it all together — a single Agent class

Stitch the pieces into one and you get a minimal agent that survives a long session. Stripped to its skeleton:

class CompactingAgent:
    def __init__(self, tools: dict):
        self.tools = tools
        self.messages: list[dict] = []
 
    def step(self, user_text: str) -> str:
        self.messages.append({"role": "user", "content": user_text})
 
        while True:
            resp = client.messages.create(
                model=MODEL,
                max_tokens=2048,
                system=build_system(),                 # inject long-term memory
                messages=self.messages,
                tools=[t.schema for t in self.tools.values()],
            )
            self.messages.append({"role": "assistant", "content": resp.content})
 
            if resp.stop_reason == "tool_use":
                for block in resp.content:
                    if block.type == "tool_use":
                        run_tool_and_record(self.messages, block, self.tools)
                self._maybe_compact()                  # check after tools too
                continue                               # one more pass with results
 
            self._maybe_compact()                      # check after the turn ends
            return "".join(b.text for b in resp.content if b.type == "text")
 
    def _maybe_compact(self):
        used = count_tokens(self.messages, build_system(),
                            [t.schema for t in self.tools.values()])
        if used / CONTEXT_WINDOW > COMPACT_AT:
            self.messages = compact(self.messages)
            self.messages = with_cache_breakpoint(self.messages)

Here's the whole control flow in one picture. Every analogy from the intro reduces to one box in this diagram.

Loading diagram…

8. Don't get burned in production — tests and a checklist

Compaction is the kind of feature that fails silently. Dropping the wrong thing throws no error; the agent just gets a little dumber, so you notice late. So bake the following into automated tests.

Decision-preservation test: give a constraint early ("never touch the files"), force a compaction once, and check the constraint survives in the summary.
Tool-pair integrity test: deliberately create a split through the middle of a tool_use / tool_result pair and check the API doesn't reject it after fix_tool_boundary.
Pointer-recovery test: after shrinking a tool output, verify the model can re-call the same tool from the pointer and recover the original.
Cache accounting test: confirm via usage that cache_read_input_tokens drops on the compaction turn and climbs again from the next turn.

Finally, the implementation checklist (the code counterpart of the intro's checklist).

Do you keep state simple as one message array?
Do you measure occupancy in tokens, not characters?
Do you fire compaction once, big, at the threshold (70–80%)?
Do you keep the recent window verbatim by token budget, without breaking tool pairs?
Do you enforce a template for the summary and validate the key fields?
Do you shrink tool outputs to a conclusion + re-fetch pointer?
Do you re-stamp the cache boundary on the new prefix right after compaction?
Do you promote facts that should outlive the session to long-term memory?

Closing

If the intro said "an agent's memory isn't magic, it's tidying up," this post decomposed that tidying into a handful of functions. Measure with count_tokens, compress with split → summarize → replace, shrink tool output on arrival, align the cache boundary with the compaction boundary, and promote what must survive to the drawers — that's all of it.

No fancy data structures, no secret model internals. One diligent loop holding state on top of a stateless model. Good agent memory ultimately comes down to how disciplined that loop is.

Coming next: this series will soon cover implementing long-term memory with RAG for real (embeddings, chunking, re-ranking) and multi-agent memory where several agents share what they remember — at the same code-level depth.