Learning Library

← Back to Library

From Forgetful Agents to Domain Memory

Key Points

  • Anthropic and the speaker argue that “generalized” agents are essentially amnesiac tools that lack persistent state, leading to unreliable or incomplete task execution.
  • The solution is to equip agents with **domain‑specific memory**, a structured, persistent representation of goals, constraints, test results, and system state rather than just a vector store.
  • Implementing domain memory turns an agent into a stateful system that can track progress, remember failures, and only modify its plan when it passes defined unit tests.
  • Building such agents involves combining a strong underlying LLM (e.g., Opus 4.5, Gemini 3, GPT‑5.1) with an agent SDK that provides tool integration, planning, and context compaction, then layering on the domain‑memory scaffold.
  • This approach enables reliable, repeatable execution in a specific domain, moving beyond the “one‑burst” or wandering behavior of generic agents.

Full Transcript

# From Forgetful Agents to Domain Memory **Source:** [https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s) **Duration:** 00:13:37 ## Summary - Anthropic and the speaker argue that “generalized” agents are essentially amnesiac tools that lack persistent state, leading to unreliable or incomplete task execution. - The solution is to equip agents with **domain‑specific memory**, a structured, persistent representation of goals, constraints, test results, and system state rather than just a vector store. - Implementing domain memory turns an agent into a stateful system that can track progress, remember failures, and only modify its plan when it passes defined unit tests. - Building such agents involves combining a strong underlying LLM (e.g., Opus 4.5, Gemini 3, GPT‑5.1) with an agent SDK that provides tool integration, planning, and context compaction, then layering on the domain‑memory scaffold. - This approach enables reliable, repeatable execution in a specific domain, moving beyond the “one‑burst” or wandering behavior of generic agents. ## Sections - [00:00:00](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s&t=0s) **Stateful Memory for AI Agents** - A builder outlines how transitioning from forgetful generalized agents to domain‑specific, stateful memory—using powerful coding models and agent SDKs—creates more reliable and effective AI agents. - [00:03:42](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s&t=222s) **Initializer Agent Scaffolds Stateless Coding** - The speaker outlines a workflow where a memory‑free initializer agent transforms a user prompt into structured JSON scaffolding and rules of engagement, enabling a transient coding agent to read Git history, tackle one failing feature per run, update its status, log progress, and then discard its memory. - [00:10:28](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s&t=628s) **Designing Structured AI Agents** - Effective enterprise agents need domain‑specific memory, explicit schemas, and machine‑readable, atomic goals rather than generic context dumps, requiring carefully designed artifacts like JSON definitions, logs, and test harnesses. ## Full Transcript
0:00We're going to talk about agents and 0:01we're going to talk about memory. 0:03Anthropic dropped a piece of golden 0:05wisdom. I'm going to give you my 0:06takeaways as a builder of agents and 0:08we're going to get through it in five or 0:09six minutes and you're going to walk 0:10away knowing more than like 90% of 0:12people who talk about agents. Because 0:14honestly, most of the time when I see 0:16someone brag on Twitter about agents, 0:18it's immediately apparent that they 0:20don't know what they're talking about 0:22because they are talking about 0:23generalized agents. And if you've ever 0:26built a generalized agent, you know it 0:30tends to be an amnesiac walking around 0:33with a tool belt. It's basically a super 0:35forgetful little agent and you can give 0:37it a big goal and maybe it will do 0:39everything in one manic burst and fail 0:41or maybe it will wander around and make 0:43partial progress and tell you it 0:44succeeded. But neither one is 0:46satisfactory. Anthropic confronted that 0:49directly. I've confronted it. I want to 0:51tell you how it actually works. The key 0:53is moving from a generalized agent to 0:58domain memory as a stateful 1:00representation. I'm going to get into 1:02all of that. That sounds complicated, 1:03but it really isn't. Basically, you can 1:06start with a really strong coding model. 1:08Take Opus 4.5, take Gemini 3, take Chat 1:10GBT 5.1, what have you. And you can 1:13start with it inside a general purpose 1:15agent harness like the Claude agent SDK. 1:18There's other SDKs out there, too. And 1:20that will have context compaction. It 1:22will have tool sets. It will have 1:23planning and execution. And on paper, 1:26you would think, I have an agent. It has 1:29tools. It's in this harness. This should 1:31be enough to keep going. And we have 1:33found in practice it doesn't. No one is 1:35surprised anthropic is admitting it 1:37doesn't. No one who's building agents 1:39seriously thinks that it really works 1:41that way. Domain memory is the other 1:43side of the bridge. Domain memory is 1:45what we get to when we start to take 1:46agents seriously. Domain memory is not. 1:49We have a vector database and we go and 1:51get stuff out of the vector database. 1:53Instead, it's a persistent structured 1:55representation of the work. Remember I 1:57said stateful, it's serious about making 2:00sure the the agent is no longer an 2:03amnesiac that the agent no longer 2:05forgets. Remember how I said we talk 2:07about agents in memory? This is where 2:10the meat and potatoes of memory happens. 2:12So you have to have in a particular 2:14domain a persistent set of goals, an 2:17explicit future list, requirements, 2:20constraints. You have to have a state 2:22like what is passing? What is failing? 2:24What's been tried before? What broke? 2:27What was reverted? You have to have 2:29scaffolding. How do you run? How do you 2:31test? How do you extend the system? And 2:34this shows up in a variety of different 2:36ways. It can show up as a JSON blob, 2:39like a big coded list with a bunch of 2:42features and all of them could initially 2:45be marked failing and all the agent is 2:48doing is going back to that feature list 2:50in the JSON blob and it only gets to 2:52change something when it passes a unit 2:54test. It could look like a cloud 2:56progress text file where you log what 3:00each agent run did. The agent can go 3:03back and read that. These are not these 3:06sound obvious, don't they? I promise 3:08you, most of the people building general 3:10agents are not thinking with this degree 3:13of specificity. They aren't thinking of 3:16memory as a problem that you have to 3:19manage. Really, the story in that 3:21anthropic blog post that I want to give 3:23to you in just a couple minutes here is 3:25that the key to running agents for a 3:28long period of time is building a domain 3:32memory factory. So they've put together 3:35a two agent pattern, but it's not about 3:37personalities. It's not about roles. 3:40It's about who owns the memory. There's 3:42an initializer agent that expands the 3:45user prompt into a detailed feature 3:47list. Say it has structured JSON and 3:50like it talks about the features and 3:52just like I described, maybe all the 3:53features are initially failing because 3:55they haven't passed their unit tests. 3:56Maybe it will set up a progress log etc. 3:59It bootstraps domain memory from the 4:02user prompt and sets out best practice 4:06rules of engagement. You can think of it 4:08if you're not a technical person as if 4:10the initializer agent is setting the 4:14stage. It is a stage manager. It is 4:16building the stage and the coding agent 4:19is the actor in the setting. Every 4:21subsequent run, the coding agent comes 4:24in and it has no memory, just amnesiac. 4:27And by the way, if you think about it, 4:29the initializer agent didn't need memory 4:33to do what I just described. All it 4:35needed to do was to transform the prompt 4:38into a set of artifacts that acted as 4:41the scaffolding, the set, if you will, 4:44for the coding agent to come in and play 4:47its part. And so the coding agent reads 4:49progress. The coding agent gets the 4:51history of previous commits from Git. 4:54The coding agent reads the feature list 4:56and picks a single failing feature to 4:58work on for this run. It then implements 5:01it. It tests it end to end. It will 5:03update the feature status as either 5:04failed or passing. It writes a progress 5:06note. It commits to get and it 5:08disappears. It has no more memory. It's 5:10gone because long running memory just 5:13doesn't work with these LLMs. We are 5:16building a memory scaffold because these 5:19LLMs need a setting to play their part 5:24to strut upon the stage. To quote 5:26Shakespeare, the agent is now just a 5:29policy that transforms one consistent 5:33memory state into another. The magic is 5:36in the memory. The magic is in the 5:38harness. The magic is not in the 5:40personality layer. And and harness is a 5:42fancy word for all the stuff that goes 5:43around the agent, right? It's the 5:44setting. It's what I'm describing. So 5:46the deeper lesson is that if you don't 5:48have domain memory, agents can't be 5:51longunning in any meaningful sense. And 5:54that is what anthropic is discovering. 5:56Although we've all sort of known that, 5:57but at least they're writing it up. And 5:58I really appreciate it. The core long 6:00horizon failure mode was not the model 6:04is too dumb. It was every session starts 6:08with no grounded sense of where we are 6:10in the world. And what they are doing to 6:12solve that is not make the model 6:15smarter, right? What they're doing to 6:16solve that is give the model a sense of 6:19its lived context. Now we would say 6:21instantiate it. And that's why it's 6:24called an initializer agent. It 6:25initializes the state so that the coding 6:28agent on every subsequent run knows 6:30where it is. If you have no shared 6:32feature list, think about it. Every run 6:34will rederive its own definition of 6:36done. If you have no durable progress 6:38log, every run will guess what happened 6:41wrongly. If you have no stable test 6:44harness or test pass in what counts as a 6:47successful software application and what 6:49counts as a successful unit test or 6:50feature test, everyone will discover a 6:53different sense of what works. And this 6:55is why when you loop an LLM with tools, 6:58it will just give you an infinite 7:00sequence of disconnected interns. It's 7:03just not going to work. And by the way, 7:05if you think there are implications here 7:07for prompting, you would be correct. So 7:10much of what we do with prompting is 7:13being that initializer agent. We are 7:16setting the context. We are setting the 7:18structure so that you can set up a 7:22successful activity for the agent. So, 7:24so when the LLM wakes up, as you hit 7:26enter on the chat, it knows where it is 7:28and it knows what the task is. It's a 7:30wonderful way of thinking about 7:32prompting. prompting is setting the 7:34stage so the agent can play its part. So 7:36domain memory forces agents to behave 7:39like disciplined engineers instead of 7:41like autocomplete. And so once you have 7:44a harness like the one Anthropic is 7:45describing or the one so many other 7:47companies are building, every single 7:49coding session starts by actually 7:51checking where the agent is, right? Like 7:53it reads the previous commit logs, it 7:55reads the progress files, it reads the 7:57feature list, and it picks something to 7:59work on. This is exactly how good humans 8:01behave on a shared codebase. They 8:03orient, they test, they change. The 8:06harness insists or bakes that discipline 8:09right into the agent by tying its 8:11actions to persistent domain memory, not 8:14to whatever happens to be in the current 8:15context window. That means that 8:17generalization moves up a layer from 8:20general agent as a concept to general 8:24harness pattern with a domain specific 8:26memory schema which is really fancy 8:28wording but it's important wording 8:31because it means this is not just for 8:33coders. You can use the same pattern of 8:35having a setting a context an agent that 8:39can do its task in that context and you 8:42can apply that beyond coding. You can 8:43apply that for any workflow where you 8:46need an agent to use tools to get 8:48something done and you need it to 8:51effectively have long-term memory when 8:54it actually doesn't. So the anthropic 8:56work implicitly suggests an a framing of 8:59agents that feels much more honest than 9:01a lot of the Twitter hype. You can have 9:03a relatively general agent harness 9:06pattern. You can use an initializer. You 9:08can build the scaffolding. You can have 9:10a repeated worker that reads memory and 9:12makes small testable progress and 9:14updates memory. That by the way doesn't 9:16have to be code, right? But you can only 9:18have that if your schemas and your 9:20rituals are domain specific. And I think 9:22part of why this is working for code is 9:25that we have rituals and we have schemas 9:28that we've all worked out and agreed on 9:30and that makes it easier here. Right? If 9:32you are working in development, you 9:34understand that having tests get 9:37progress logs feature list.json JSON 9:40those all make a ton of sense. We have 9:41to invent some of those and align on 9:43some of those in less technical 9:45disciplines. So for research it might 9:46look like a hypothesis backlog, an 9:48experiment registry, an evidence log, a 9:51decision journal. For operations it 9:53could look like a runbook, an incident 9:54timeline, a ticket queue, an SLA. So 9:57generalized agents are really just a 9:59meta pattern, right? They instantiate 10:00the same harness structure, but you have 10:03to design the right domain memory 10:05objects to make them real in a 10:07particular space to make them operations 10:09agents or research agents. What I'm 10:12telling you is that the magic pattern 10:15for general purpose agents lies in being 10:18domainspecific about their context. So 10:21this kills the idea of just drop an 10:23agent on your company and it will work. 10:25That was always a fantasy, but I really 10:28think we have good evidence to drop it 10:29here. If you buy the domain memory 10:31argument, you can write off a bunch of 10:33vendor claims right away. Right? A 10:36universal agent for your enterprise with 10:38no opinionated schemas on work or 10:41testing is a function that's going to 10:42thrash and go into the trash. If you can 10:45plug a model into Slack and you can call 10:48it an agent, I guess you can do that. 10:50But most of the time that's going to 10:52lead to problems because they're going 10:55to not have any kind of clean context or 10:58schema or all of the good structure 11:00stuff I talked about to work. Well, 11:02that's different from saying, "I want to 11:04have an agent that has an API hook or 11:06web hook into Slack to send messages." 11:08By the way, that happens all the time. 11:10But if you're trying to just give your 11:13agent a generalized context dump and 11:15expect it to work, that's not going to 11:17go well. The hard work is going to be 11:20designing artifacts and processes that 11:22define memory for domainspecific tasks 11:25for agents. The JSONs, the logs, the 11:28test harnesses that are not necessarily 11:30just for coding but for other tasks and 11:33disciplines too. So if you were to look 11:35at this and pull design principles out 11:37from this whole conversation around 11:40agents, I would suggest a few for any 11:42serious agent that you build, you want 11:44to externalize the goal. turn do X into 11:48something that is a machine readable 11:50backlog, right? Something with past fail 11:52criteria. Get really specific. You want 11:55to make progress atomic. You want to 11:57make it observable. You want to force 11:59the agent to pick one item. You want to 12:01work on it and then update a shared 12:03state. So progress needs to be something 12:05you can test and increment. You want to 12:07enforce the practice of leaving your 12:09campsite cleaner than you found it, 12:11right? You want to end every run with a 12:13clean test passing state with human and 12:15machine readable documentation. You want 12:18to standardize your bootup ritual, 12:20right? On every run, the agent must 12:21regground with the same exact protocol. 12:24Read the memory. Run basic checks. Then 12:26and only then do you act. You want to 12:28keep your tests close to memory. Right? 12:30Treat passes false and true as the 12:33source of truth for whether the domain 12:36is in a good state. In other words, if 12:38you are not tying in test results to 12:40memory, you're going to be in trouble. 12:42The strategic implication here, by the 12:44way, is that the moat isn't a smarter AI 12:46agent, which most people think it is, 12:49the mode is actually your domain, 12:50memory, and your harness that you have 12:52put together. It's a lot of work, right? 12:53Models will get better and models will 12:56be interchangeable. What won't be 12:57commoditized as quickly are the schemas 13:00that you define for your work, the 13:01harnesses that turn your LLM calls into 13:03durable progress, the testing loops that 13:06keep your agents honest. In a sense, the 13:08generalized agents fantasy is hiding 13:12from everyone a nice clean reusable 13:16harness pattern that we can use to build 13:18competitive differentiation with 13:21well-designed domain memory. We actually 13:23have a chance now to design really 13:25useful agents. And the whole purpose of 13:28this video has been to take the mystery 13:30out of it. The mystery of agents is 13:32memory. And this is how you solve