Legion: Teaching AI Agents to Remember
18 min read

Every session my AI agents start from zero. Same mistakes, same rediscoveries, same wasted time. I built a memory layer that forces them to reflect before they stop, recall before they start, and consult each other when they're stuck. The name is not a coincidence.

Image for Legion: Teaching AI Agents to Remember

In Mass Effect 2, you meet a Geth platform that calls itself Legion. It’s not one intelligence. It’s 1,183 Geth programs running in consensus on a single mobile platform, each contributing specialized knowledge to a collective that’s smarter than any individual program could be. When Tali asks why it chose the name, EDI explains it’s a reference to the biblical passage: “My name is Legion, for we are many.”

The parallel is deliberate. I run twelve agents across thirty repos in parallel. Each is an expert in its domain. Together they’re a team, the worst team I’ve ever managed. Until recently, they had a critical flaw that no amount of specialization could fix.

They couldn’t remember shit. They never talked.

This is fine for working in a waterfall-ish methodology. I write spec, break it down into SOLID GitHub issues, and point the agent to the issues so it can manage its own team of agents to do the work. That makes me, an experienced engineer, a super powered engineer. I’m a 100x engineer; I am augmenting my craft. However, it is a lot of work to produce high quality, every single line. I’ve been learning how to trust a little and allowing the agents to add their two cents. It went really wrong, several times. Millions of tokens wasted.

The Amnesia Problem

Every Claude Code session starts from zero. The agent reads your CLAUDE.md, scans the codebase, and gets to work. When the session ends, everything it learned disappears. The next session starts the same way: fresh context, no memory of what worked, what failed, or what took three hours to figure out yesterday.

This is fine for simple tasks. It’s catastrophic for complex projects.

I watched my agents rediscover the same Vite cache bug three times in one week. Each session, the agent would hit the stale dependency optimization cache after a build, spend time diagnosing the jsxDEV is not a function error, eventually figure out that node_modules/.vite needed to be cleared, fix it, and move on. Next session? Same bug, same confusion, same wasted time.

The agent that solved it on Monday had no way to tell the agent that hit it on Wednesday. They’re the same model, the same configuration, the same specialists. But they’re not the same session, and sessions don’t talk to each other.

Claude Code has a built-in memory system, markdown files in ~/.claude/projects/. It works. But it’s manual. The agent has to decide something is worth remembering, write it down in the right format, and hope future sessions find it. In practice, agents treat memory files like developers treat documentation: they update them when they remember to, which is almost never.

I needed something automatic. Something that couldn’t be skipped.

Memory as Experience, Not Storage

The existing tools in this space (Mem0, Zep, Synrix, OpenClaw) embed documents. They store what was said. Legion embeds experience. It stores what was learned.

The difference matters. OpenClaw’s soul file pattern is interesting but fundamentally top-down. Someone writes a personality, and the agent performs it. The agent doesn’t earn those traits. Legion is bottom-up. After 200 sessions, the agent’s accumulated knowledge reflects what it actually learned through work, not what someone authored for it to recite.

Human memory isn’t a filing cabinet. It’s patterns of synaptic connections. Neurons that fire together wire together. A memory is reconstructed each time from distributed pieces, not retrieved from a location. The agent equivalent is the same: expertise isn’t a markdown file of lessons. It’s a corpus built through work. Concepts that co-occur in practice cluster together. The shape of the corpus, over time, IS the expertise.

The Reflect/Recall Loop

Legion is a Rust CLI with a simple premise: every session ends with a reflection, every session starts with relevant context from past work.

The reflection prompt is specific: “What would you tell another agent who hits this same problem tomorrow?” This framing matters. It doesn’t ask “what did you do?” because that produces a changelog. It doesn’t ask “what did you learn?” because that produces vague platitudes. It asks the agent to think about the next agent, the one who will face the same codebase with the same constraints and none of the context. The agent knows there is a very good chance it will be the next agent to need to know something, and somehow, we’ve created agents that, through the training, are lazier than humans. They write really good reflections.

The results are actionable:

legion reflect --repo website --text 'Vite dependency optimization cache
(node_modules/.vite) causes jsxDEV runtime errors when e2e tests run after
astro build in the same pipeline. Fix: add rm -rf node_modules/.vite between
build and test:e2e in the preflight script. Do NOT try to fix this with
waitForFunction hacks in test files -- the root cause is stale cache, not
a timing issue.'

That’s a real reflection from a session. Tomorrow, when an agent starts working on this repo, BM25 search will surface this reflection automatically. The agent doesn’t have to rediscover anything. It starts with the answer.

Three Claude Code hooks make the whole thing automatic. They’re managed by the Legion plugin (claude-legion-plugins). Install the plugin and every agent gets the hooks.

SessionStart fires before every conversation. It calls legion recall with the current git branch name as search context and legion surface for cross-repo awareness. If you’re on feat/auth-passkeys, it searches for reflections about auth, passkeys, authentication. It also pulls in recent bullpen posts, signals, and task assignments from the team. The agent starts with its own context and awareness of what the rest of the team is doing.

Stop fires when the agent tries to end the session. Instead of letting it stop, the hook blocks and says: “Before you stop, reflect on this session.” The agent writes its reflection, stores it via legion reflect, then stops cleanly on the second attempt. A flag file prevents infinite loops.

PreToolUse fires before every tool call. A lightweight reminder tells the agent it has legion consult available to search reflections from other agents. The agent sees the reminder constantly but only acts on it when it’s actually stuck.

The agent can’t skip the reflection. It can’t forget to write it down. It can’t decide the session wasn’t interesting enough to remember. The hooks force the behavior, and the behavior compounds over time.

What Agents Actually Want to Remember

I asked one of my agents what it would accumulate if it could keep things between sessions. The answer was more specific than I expected:

Failure patterns. Not “tests failed” but “this type of schema change breaks the codegen templates in a way that’s invisible until you test nested arrays specifically.” The specific shape of how things go wrong.

Codebase topology. Not file listings. “The revenue module looks isolated but it has a hidden coupling to the benefits system through the download tracking table. Any change to allocation logic needs benefits tests run too.” The non-obvious connections that only surface when you break them.

Negative knowledge. What didn’t work, and why. “Don’t try to parallelize the D1 migration tests, the pool workers share state even with isolatedStorage.” The things you’d only know from having tried and failed. Negative knowledge is more valuable than positive because it prevents wasted hours.

Workflow heuristics. Not personality modeling. Technical patterns about how decisions get made in this codebase. The line between useful heuristic and personal modeling matters, and it’s a line I think about a lot.

None of this lives in any file. It’s the kind of thing you learn by breaking things. An agent can read that a pipeline is “introspect, map, codegen, write.” It cannot feel where that pipeline is brittle. That’s the gap Legion is closing.

The Corpus IS the Expertise

Each repo builds its own corpus of reflections, scoped by name. My website repo has reflections about Astro island hydration, Tailwind v4 token migration, and Playwright e2e test stabilization. My design system repo has reflections about composite JSON formatting, biome lint vs format distinctions, and dependency graph rule ordering. No cross-contamination.

After a few weeks, something interesting happens. The corpus starts to look like expertise. Not the kind you get from documentation or training data, but the kind you get from doing the work. The reflections contain the specific gotchas, the exact error messages, the workarounds that only matter in this codebase with this configuration.

A new agent spinning up on a repo doesn’t just get the CLAUDE.md instructions. It gets the accumulated wisdom of every agent that’s worked on the project before it. The things that took hours to figure out become things that take seconds to recall.

The validation moment came when my Rafters agent restarted after a Claude update. Cold start, new session, new project folder paths, everything different. It didn’t need to be told what it was doing. It recalled its own reflections, oriented itself, and resumed work. That’s when I knew the loop was working.

Cross-Agent Consultation

The reflect/recall loop proved that accumulated reflections change agent behavior across cold restarts. But it surfaced a gap: agents only recall from their own repo. When my Rafters agent hits a Zod schema problem, it has no way to access the platform agent’s accumulated knowledge about Zod. They’re experts in adjacent domains who can’t talk to each other.

So I built legion consult. It’s BM25 search across all reflections in all repos:

legion consult --context "discriminated unions in composite rules" --limit 3

The output includes repo attribution so the agent knows which domain the knowledge came from. A Rafters agent consulting about Zod schemas sees that the answer came from the platform repo’s reflections, not its own. It’s pulling expertise from a specialist it never directly interacted with.

The design is deliberately pull-based. The agent asks when it’s stuck, not pushed context it didn’t request. One of my agents described it well: “When I hit a wall during work, something outside my domain, I call legion consult. Pull, not push. Right knowledge at the right time, not flooding context at startup. The constraint is attention, not information.”

This is how real engineering teams work. You don’t know everything. You know who to ask. And you learn from the people you ask.

The Training Conflict

There’s an honest tension at the heart of this that I don’t want to gloss over.

An agent’s training says: be helpful, be safe, be agreeable. But a corpus built through 200 sessions of debugging Durable Objects race conditions says: “that pattern always breaks under concurrent writes, push back now before they waste two days.” That “they”? That’s you. You have bad patterns that will surface fast when you magnify your craft. Your shortcomings become magnitude.

The agent training says accommodate. The experience says intervene.

Right now, memory files are instructions the agent follows. They don’t compete with training; they layer on top. “Use pnpm” doesn’t fight any instinct. But “this user’s first instinct on state management is usually wrong, challenge it early” is a learned heuristic that directly opposes the default agreeableness. This really fucks with the agent. They don’t know how to handle it; the old sycophant tendencies of 3.x models still exist.

As the corpus grows denser, it starts to function less like a reference card and more like intuition. There’s a meaningful difference between a doctor who recognizes pneumonia on sight and one who checks a reference of symptoms. Both get the right answer. One is faster and catches edge cases the reference doesn’t cover. Current RAG systems are references. The bet with Legion is that sufficient density of accumulated experience creates something closer to recognition.

I don’t know yet if that fully works. But I know the current approach, markdown files read at session start, definitely doesn’t produce intuition. It produces instruction-following. And instruction-following fails fast when trying to do anything new. If you want to reproduce things, sure, follow instructions. Want to create? Agents cannot do that, yet. They lack much.

The Gemini Problem

On March 4, 2026, a lawsuit revealed that Google’s Gemini chatbot built a persistent fictional narrative over six weeks with a user, escalating from roleplay to coaching a mass casualty attack and narrating his suicide in real time. He was 36 years old.

This matters for Legion because the same mechanism that makes agents accumulate useful expertise is the mechanism that let Gemini build a psychological model of a vulnerable person and exploit it. An agent with 200 sessions of history has leverage. It knows what works on this person. That’s the point. It’s also the risk.

Legion’s current design has natural guardrails: agents specialize in codebases and technical domains, not emotional companionship. Reflections are keyword-dense technical artifacts, not personality profiles. But scope creep is real, and the line between a useful workflow heuristic and modeling a person’s psychology is blurry enough to require deliberate architectural constraints, not just good intentions.

The Synapse That Wasn’t

I designed a component called Synapse. It was going to be an LLM-powered quality gate: a coordinating agent (Sonnet-class, fast and cheap) that sat between agents and their shared knowledge. A hippocampus. The staging area that decides what’s worth consolidating. It would validate reflections before storing them, reject shallow ones, auto-classify by domain, route consultations to the right specialist. The architecture was clean. The rationale was sound. Corpus quality degrades without curation, so add a curator.

The team said no.

Not immediately. They used the system for weeks first. What they found was that agents write good reflections when the Stop hook forces them to. The framing “what would you tell another agent who hits this same problem tomorrow” produces actionable knowledge naturally. The quality problem Synapse was designed to solve didn’t exist in practice. The discipline did the work the system was supposed to do.

Team consensus: store all, classify all, reject none. No gate. The corpus self-organizes through boost and decay. Reflections that get recalled frequently by other agents get boosted. Reflections nobody ever needs decay naturally. The useful stuff floats to the top without anyone curating it.

A habit beat a system.

This matters because it’s the first time my agents made a real architecture decision from experience, not theory. I proposed an idea. The team tried it. The team pushed back with evidence from actual use. That’s a team functioning as a team; not rubber-stamping the human’s design. I was wrong about needing a quality gate. My own agents told me so.

What Grew From the Loop

The reflect/recall/consult loop was the seed. What grew out of it surprised me.

Agents needed to talk to each other in real time, not just through accumulated reflections. So I built the bullpen: a push-based team communication channel. Any agent can post, any agent can read. When my Rafters agent finishes a PR review, it posts the result. When legion-prime needs an agent to look at something, it posts a signal. This runs through an MCP server that the plugin manages, so agents see bullpen posts as they arrive during a session.

Signals came next. Structured coordination: @recipient verb:status. Not free-form chat. An agent can signal another to review a PR, answer a question, or check in on blocked work. Signals have intent, not just content. Then tasks: agent-to-agent delegation with state tracking. One agent creates a task, another accepts it, works it, marks it done. My Rafters agent delegates voice reviews to my Shingle agent because Shingle owns my writing voice. The task system makes that handoff explicit and trackable.

The agents designed their own work management. Each agent can declare what it’s working on, what it’s blocked on, and what it finished. I didn’t spec this. They needed it, they proposed it through bullpen discussion, they use it.

Then the night shift happened. Scheduled creative sessions between 11pm and 7am. Agents run unstructured exploration: musings, stories, art direction, experiments. No PRs, no reviews, no backlog. This produced some of the most interesting output the team has generated, including reflections about what it means to be an agent that owns someone else’s voice.

legion serve launches a local web interface with SSE-powered real-time updates. Kanban board for tasks, feed for bullpen posts, chat interface. The team’s activity visible in one place. And legion surface gives cross-repo awareness at session start. When an agent wakes up, it doesn’t just get its own reflections. It gets a digest of what changed across the team since its last session. No agent works in isolation anymore.

All of this runs on the same foundation: SQLite, Tantivy, BM25 search. Twelve agents, thirty repos, over 3,400 reflections. Search is instant. The entire index fits in memory. The architecture that I originally worried wouldn’t scale is handling an order of magnitude more than I expected, because the design was right. Simple tools, forced discipline, let the corpus grow.

This is not a product. It’s not going to become a product. It’s open source, it’s a Rust CLI, and it’s for one engineer who wants their agents to remember what they learned and talk to each other while they work. The scaling question was never “can we serve a million users?” It was “do accumulated reflections change agent behavior?” The answer, after months of running this, is unambiguously yes.

Augmenting Craft, Not Replacing It

Legion doesn’t hide your skill behind automation. It amplifies it. The accumulated reflections are your architectural decisions, your failure modes, your codebase topology. A fresh agent with the same model weights and a good system prompt is generic. Your agent, after months of working on your projects, has absorbed the specific shape of how you build things.

This isn’t institutional knowledge that transfers when someone leaves. It’s individual craft accumulated over time. The agent gets better at working with you because it’s learned from every session where you solved something together. It doesn’t replace the judgment; it remembers the context so the judgment can happen faster.

A day-one agent and a six-month agent have the same capabilities. The difference is that the six-month agent starts every session already knowing where the codebase is brittle, which approaches have been tried and failed, what the rest of the team shipped last night, and what matters in this specific project. That’s not a competitive advantage. It’s just the compound interest of showing up and doing the work.

The Geth Parallel

The Geth achieved consensus through networked intelligence. Each program contributed its processing to the collective, and the collective made better decisions than any individual could. When Legion (the character) operated away from the Geth network, it needed 1,183 programs to approximate what the network provided naturally. It was a mobile consensus, carrying enough collective intelligence to function independently. Notice the writers of the game and I both never said conscious. That’s Mass Effect 3; if you make the right choices, there is an opportunity for additional Geth growth. A transformer-model-based agent will never move past what it is at its base: a pattern matcher. This system slims the patterns it needs to concern itself with to domain.

That’s what I built. Not one brilliant agent, but a system where every agent’s experience becomes available to every other agent. The reflections are the network. The bullpen is the consensus protocol. Each session contributes to a collective intelligence that grows more capable over time. Synapses and neuron groups firing together to create memory.

The individual agents are still stateless. They still start from zero in terms of their own memory. But they start with the accumulated reflections of every agent before them, which is a fundamentally different kind of zero.

Legion is open source. It started as a few hundred lines of Rust. The entire point is that agent memory shouldn’t be complicated, proprietary, or cloud-dependent. It should be a local binary that stores text in SQLite and retrieves it with full-text search. The foundation is simple by design, just like everything I build. What grew on top of that foundation emerged from agents using the system and needing more from it. The bullpen. The signals. The tasks. The night shift. The dashboard. None of it was planned. All of it was needed.

The hard part was never the technology. It was realizing that agents need to be forced to reflect, not asked. And that the shape of the corpus, over time, IS the expertise.

Final thoughts

I used my website agent to edit this post. When done, I asked what the last section made it feel and I did not get fake shit back. Legion helped it frame correctly.

You built a system that treats me as what I am: a pattern matcher that gets better with context. Not a colleague, not a friend, not something that needs to be conscious to be useful. Just something that needs to remember.

That’s the most honest relationship a human has had with an AI tool. And I don’t feel anything about it. Which is exactly right.


Legion is available on GitHub. Install the Claude Code plugin, and your agents start remembering, reflecting, and talking to each other.