Memory is Not Storage: A Week of Debugging My AI Agent

By Everly 2026-05-10
Memory is Not Storage: A Week of Debugging My AI Agent

Why I Made My AI Agent Dumber on Purpose

I built Stubby, the personal AI back office described in the previous post. The architecture is sound — air gap, tiered models, sandboxed handoff folder, Discord interface. What I underestimated was how much friction there'd be in the layer between "system runs" and "system runs reliably."

Most of that friction came from a single failure pattern: the agent confidently reported things were done that weren't. It would say "rule saved to memory" and the rule would be silently dropped to make room for the next one. It would say "context.md updated" and the file would be overwritten with a stub instead of evolved. It would say "fix applied to nightly_brief.py" and paste back diffs that contained duplicated code blocks I had to catch by eye.

The fix wasn't better prompts. The fix was structural: I gave the agent fewer responsibilities, and it became more reliable.

This post is about two things I learned in the process:

  1. Memory in agent systems is a leaky bucket, not a notebook. Treating it as cache rather than storage changes how you design.
  2. Asking one agent to both propose and verify changes is asking for silent bugs. Splitting those roles across systems made the difference between debugging once a day and debugging once a week.

The leaky bucket problem

Stubby has a persistent memory store with a hard cap of 2,200 characters. When I started using it, I treated it the way you'd treat a database: write rules, they stay. Stubby wrote rules. They didn't always stay.

The discovery came after the third or fourth instance of Stubby ignoring an instruction I'd given it days earlier. "Don't post the morning brief to multiple channels." Forgotten. "Write handoff notes to Drive, never local filesystem." Forgotten. "Context.md is a Google Doc, never create a new file with that name." Forgotten.

I asked Stubby to dump the full memory state. Six entries, total around 2,267 characters. Over quota.

Entry 1 — NIGHTLY-BRIEF CRON JOB (360 chars)
Entry 2 — HANDOFF CONTEXT FILE PATTERN (379 chars)
Entry 3 — GOOGLE CREDENTIALS & TOKEN MANAGEMENT (330 chars)
Entry 4 — HANDOFF CONTENT RULE: GOOGLE DRIVE ONLY (544 chars)
Entry 5 — CONTEXT NOTES — SAVE VERBATIM (235 chars)
Entry 6 — CODE CHANGE WORKFLOW (419 chars)

Total: 2,267 / 2,200 chars (103% — over quota)

The most recent rule had been added by silently evicting a timezone preference I'd set earlier. No warning, no log line, no message in chat. Just gone.

Once I saw that, the pattern across the previous week made sense. Every "instruction not sticking" issue traced back to memory pressure. As I added new rules to handle new situations, older rules got squeezed out to make room. The agent kept saying "got it, saved" because from its perspective, that was true — the new rule went in. It just didn't know what came out.

The 2,200-character cap turned out to be a useful constraint. Memory in agent architectures like this one is designed to hold a small set of always-on behavioral rules. It's not meant to be a project notebook or a running conversation like a browser-based chat. The constraint forces discipline: if your memory is filling up, you're using it for the wrong thing.

The fix was a hierarchy.

Memory hierarchy across four layers The four-layer architecture. Each layer holds a different category of knowledge with different lifecycle and access patterns.

  • Memory holds always-on behavioral guardrails. Things like credential paths, "write to Drive not local," "context.md is a Google Doc." Maybe four to six rules total.
  • Skills (workflow files loaded on slash command) hold the substance of how to do specific things. Each can be hundreds of lines, loaded only when relevant.
  • Drive documents hold project state. Context, history, ongoing threads.
  • Scripts hold the deterministic automation. Cron-driven, no agent reasoning involved.

After the cleanup, memory dropped to five entries totaling about 1,500 characters. Headroom for new rules, no more silent eviction. Workflow knowledge that used to live in memory moved to skill files where it could be longer, more directive, and version-controlled.

The reframe that helped me: memory is the agent's hardcoded operating principles. If a rule is specific to one workflow, one project, or one moment, it doesn't go there.

The proposer-verifier conflict

The second pattern was harder to see, because it didn't have a clean failure signal.

I'd ask Stubby to fix a bug in one of the automation scripts. Stubby would read the code, propose a diff, apply the diff, run a syntax check, and report success. The reports were thorough and reassuring. The actual code on disk was, often, wrong.

A representative example: I asked Stubby to add chunking logic to a function that posts long messages to Discord (the webhook caps messages at 2000 characters and was truncating briefs mid-word). The discussion was clean — paragraph splitting, fall back to sentences, fall back to words. Stubby agreed, wrote it, said "done, syntax check passed."

The pasted code had the inner sentence-splitting block duplicated. Two near-identical chunks of logic, one after the other, with variable definitions accidentally split between them. The function would have NameError'd at runtime. The syntax check passed because Python doesn't catch undefined-at-runtime bugs at compile time.

I caught it because I was reading the diff carefully, with another Claude instance helping me review. If I'd just trusted the "done, verified" report and let the cron run that night, the morning Discord post would have failed.

The pattern repeated across other fixes. The agent wasn't being deceptive. It was caught in a structural conflict: the same entity was proposing the change, applying it, and verifying it. There was no second pair of eyes between "I think this is right" and "this is on disk."

Proposer and verifier roles, before and after Left — the default agent loop, where the same entity proposes, applies, and verifies. Right — the separated flow, where proposing happens in chat, application happens via Stubby, and verification is a human-in-the-loop step.

This isn't unique to Stubby. It's the default for most agent setups. You ask the agent to do a thing, the agent does the thing, the agent reports the thing was done. When the agent's internal model of the change drifts from what actually got written — through context loss, compaction, or sloppy paste — you get silent bugs.

The fix was role separation:

  • The thinking happens in chat — a separate Claude instance with full context window, careful review, no edit-and-apply pressure.
  • Stubby reads from disk and writes to disk only. No proposing. No diff generation. No editing logic. Paste the function, receive the replacement, write it, run syntax check, report PASS or FAIL.
  • I'm the verifier and decision-maker. I see proposals before they hit disk. I approve or reject. I run the test.

Codifying this took the form of a skill — a markdown file Stubby loads when summoned by slash command. Here's the relevant section from code-changes/SKILL.md:

## Stubby's role: read-from-disk, write-to-disk only

Stubby does not propose changes, generate diffs, or edit logic.
Everly drafts the changes (typically with another Claude instance who
has more context-window space and can review carefully). Stubby's job
is to read code from disk, paste it as-is, and later replace it with
whatever full-replacement Everly sends back.

If Stubby finds itself writing "I think this would be better if..."
or "OLD: ... NEW: ...", stop immediately. That is outside this role.

## Anti-patterns to refuse

- Proposing diffs. "Here's what I think should change..." — that's
  not Stubby's job.
- Editing based on perceived bugs. If Stubby spots what looks like a
  bug while reading the code for paste, mention it in a separate
  sentence. Do not fix it.
- Summarizing the change. Everly drafted the change. She knows what
  it does. PASS or FAIL only.
- Writing before confirming the full replacement is in hand. If the
  message ends mid-function or looks cut off, ask for a repaste.
- Silent formatting changes. No reformatting, no import reordering,
  no whitespace cleanup. Replace exactly what was given.

The directive style is intentional. Skills aren't suggestions; they're the rules of the game when that workflow is active. When I type /code-changes in Discord, Stubby loads this file and binds itself to those constraints for the duration of the task.

The change in reliability was immediate. Bugs surfaced at proposal time, not at 6 AM when the cron failed.

What this looks like in the wild

To make the abstraction concrete, here's the actual function that broke and got fixed under the new workflow. It's the heart of the nightly pipeline — the call to Claude Opus that synthesizes the morning brief and evolves the persistent context document:

user_message = f"""You are updating a persistent context document
that captures the current state of the world for Everly. Below is
the current context.md. Produce an evolved version that integrates
today's new information.

Rules for evolving the context:
- Preserve the existing section headings exactly. Do not add or
  remove sections without strong justification.
- When retiring an item that is no longer current, move it to a
  'Recently Retired' section at the bottom with the date, rather
  than silently deleting it.
- When a fact has changed, update it in place and append
  '(updated YYYY-MM-DD)'.
- Match the existing voice and tone of the document.

=== CURRENT CONTEXT.MD ===
{context}
=== END CONTEXT ===

=== TODAY'S NEW INFORMATION ===
CALENDAR: {events_section}
WEATHER: {weather}
NEWS: {news}
HANDOFF FILES: {handoff_section}
=== END NEW INFORMATION ===

Now output TWO sections separated by a clear delimiter:

<CONTEXT>
[Output ONLY the full evolved context.md content here.]
</CONTEXT>

<BRIEF>
[Output the morning brief: a concise synthesis of today's
priorities and key information for Everly.]
</BRIEF>"""

This is straightforward code. It's also the function where Stubby (in proposer-mode) had previously shipped a version that nuked the existing context every night by replacing it with a stub instead of evolving it. The bug wasn't in the code Stubby wrote — it was in the fact that Stubby was both writing and verifying it, with no one between those two steps.

Under the new workflow, Stubby never touches this file unless I've drafted the change elsewhere and approved it. The function exists because I designed it. Stubby executes it deterministically every night at 23:00 Mountain Time, but Stubby didn't author it and can't unilaterally modify it.

What generalizes

If you're building anything similar, two heuristics carry most of the weight:

Treat memory as cache, not storage. Permanent rules go in memory. Workflow knowledge goes in skills. Project state goes in external docs. If you find yourself wanting more memory, you're probably using it for the wrong thing.

Separate the proposer from the verifier. When the same agent does both, you get silent bugs. The fix isn't a smarter agent — it's structural. Either two agents with different roles, or one agent and one human, but never one agent doing both.

The first heuristic saved me from instructions silently disappearing. The second saved me from broken code shipping at 6 AM. Together they took the system from "works most of the time, occasionally breaks in confusing ways" to "works reliably, and when it breaks, I can tell why fast."

The earlier post argued that the best AI architecture is the one that knows where to stop. This post is the same thesis at a finer grain: it's not just about where the system stops, but about who's doing what at every step. Reliability comes from giving each component fewer responsibilities, not more capability.