Learning Library

← Back to Library

From Forgetful Agents to Domain Memory

13m • AI News & Strategy Daily | Nate B Jones • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

Anthropic and the speaker argue that “generalized” agents are essentially amnesiac tools that lack persistent state, leading to unreliable or incomplete task execution.
The solution is to equip agents with **domain‑specific memory**, a structured, persistent representation of goals, constraints, test results, and system state rather than just a vector store.
Implementing domain memory turns an agent into a stateful system that can track progress, remember failures, and only modify its plan when it passes defined unit tests.
Building such agents involves combining a strong underlying LLM (e.g., Opus 4.5, Gemini 3, GPT‑5.1) with an agent SDK that provides tool integration, planning, and context compaction, then layering on the domain‑memory scaffold.
This approach enables reliable, repeatable execution in a specific domain, moving beyond the “one‑burst” or wandering behavior of generic agents.

Sections

Full Transcript

# From Forgetful Agents to Domain Memory **Source:** [https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s) **Duration:** 00:13:37 ## Summary - Anthropic and the speaker argue that “generalized” agents are essentially amnesiac tools that lack persistent state, leading to unreliable or incomplete task execution. - The solution is to equip agents with **domain‑specific memory**, a structured, persistent representation of goals, constraints, test results, and system state rather than just a vector store. - Implementing domain memory turns an agent into a stateful system that can track progress, remember failures, and only modify its plan when it passes defined unit tests. - Building such agents involves combining a strong underlying LLM (e.g., Opus 4.5, Gemini 3, GPT‑5.1) with an agent SDK that provides tool integration, planning, and context compaction, then layering on the domain‑memory scaffold. - This approach enables reliable, repeatable execution in a specific domain, moving beyond the “one‑burst” or wandering behavior of generic agents. ## Sections - [00:00:00](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s&t=0s) **Stateful Memory for AI Agents** - A builder outlines how transitioning from forgetful generalized agents to domain‑specific, stateful memory—using powerful coding models and agent SDKs—creates more reliable and effective AI agents. - [00:03:42](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s&t=222s) **Initializer Agent Scaffolds Stateless Coding** - The speaker outlines a workflow where a memory‑free initializer agent transforms a user prompt into structured JSON scaffolding and rules of engagement, enabling a transient coding agent to read Git history, tackle one failing feature per run, update its status, log progress, and then discard its memory. - [00:10:28](https://www.youtube.com/watch?v=xNcEgqzlPqs&t=449s&t=628s) **Designing Structured AI Agents** - Effective enterprise agents need domain‑specific memory, explicit schemas, and machine‑readable, atomic goals rather than generic context dumps, requiring carefully designed artifacts like JSON definitions, logs, and test harnesses. ## Full Transcript

0:00We're going to talk about agents and 0:01we're going to talk about memory. 0:03Anthropic dropped a piece of golden 0:05wisdom. I'm going to give you my 0:06takeaways as a builder of agents and 0:08we're going to get through it in five or 0:09six minutes and you're going to walk 0:10away knowing more than like 90% of 0:12people who talk about agents. Because 0:14honestly, most of the time when I see 0:16someone brag on Twitter about agents, 0:18it's immediately apparent that they 0:20don't know what they're talking about 0:22because they are talking about 0:23generalized agents. And if you've ever 0:26built a generalized agent, you know it 0:30tends to be an amnesiac walking around 0:33with a tool belt. It's basically a super 0:35forgetful little agent and you can give 0:37it a big goal and maybe it will do 0:39everything in one manic burst and fail 0:41or maybe it will wander around and make 0:43partial progress and tell you it 0:44succeeded. But neither one is 0:46satisfactory. Anthropic confronted that 0:49directly. I've confronted it. I want to 0:51tell you how it actually works. The key 0:53is moving from a generalized agent to 0:58domain memory as a stateful 1:00representation. I'm going to get into 1:02all of that. That sounds complicated, 1:03but it really isn't. Basically, you can 1:06start with a really strong coding model. 1:08Take Opus 4.5, take Gemini 3, take Chat 1:10GBT 5.1, what have you. And you can 1:13start with it inside a general purpose 1:15agent harness like the Claude agent SDK. 1:18There's other SDKs out there, too. And 1:20that will have context compaction. It 1:22will have tool sets. It will have 1:23planning and execution. And on paper, 1:26you would think, I have an agent. It has 1:29tools. It's in this harness. This should 1:31be enough to keep going. And we have 1:33found in practice it doesn't. No one is 1:35surprised anthropic is admitting it 1:37doesn't. No one who's building agents 1:39seriously thinks that it really works 1:41that way. Domain memory is the other 1:43side of the bridge. Domain memory is 1:45what we get to when we start to take 1:46agents seriously. Domain memory is not. 1:49We have a vector database and we go and 1:51get stuff out of the vector database. 1:53Instead, it's a persistent structured 1:55representation of the work. Remember I 1:57said stateful, it's serious about making 2:00sure the the agent is no longer an 2:03amnesiac that the agent no longer 2:05forgets. Remember how I said we talk 2:07about agents in memory? This is where 2:10the meat and potatoes of memory happens. 2:12So you have to have in a particular 2:14domain a persistent set of goals, an 2:17explicit future list, requirements, 2:20constraints. You have to have a state 2:22like what is passing? What is failing? 2:24What's been tried before? What broke? 2:27What was reverted? You have to have 2:29scaffolding. How do you run? How do you 2:31test? How do you extend the system? And 2:34this shows up in a variety of different 2:36ways. It can show up as a JSON blob, 2:39like a big coded list with a bunch of 2:42features and all of them could initially 2:45be marked failing and all the agent is 2:48doing is going back to that feature list 2:50in the JSON blob and it only gets to 2:52change something when it passes a unit 2:54test. It could look like a cloud 2:56progress text file where you log what 3:00each agent run did. The agent can go 3:03back and read that. These are not these 3:06sound obvious, don't they? I promise 3:08you, most of the people building general 3:10agents are not thinking with this degree 3:13of specificity. They aren't thinking of 3:16memory as a problem that you have to 3:19manage. Really, the story in that 3:21anthropic blog post that I want to give 3:23to you in just a couple minutes here is 3:25that the key to running agents for a 3:28long period of time is building a domain 3:32memory factory. So they've put together 3:35a two agent pattern, but it's not about 3:37personalities. It's not about roles. 3:40It's about who owns the memory. There's 3:42an initializer agent that expands the 3:45user prompt into a detailed feature 3:47list. Say it has structured JSON and 3:50like it talks about the features and 3:52just like I described, maybe all the 3:53features are initially failing because 3:55they haven't passed their unit tests. 3:56Maybe it will set up a progress log etc. 3:59It bootstraps domain memory from the 4:02user prompt and sets out best practice 4:06rules of engagement. You can think of it 4:08if you're not a technical person as if 4:10the initializer agent is setting the 4:14stage. It is a stage manager. It is 4:16building the stage and the coding agent 4:19is the actor in the setting. Every 4:21subsequent run, the coding agent comes 4:24in and it has no memory, just amnesiac. 4:27And by the way, if you think about it, 4:29the initializer agent didn't need memory 4:33to do what I just described. All it 4:35needed to do was to transform the prompt 4:38into a set of artifacts that acted as 4:41the scaffolding, the set, if you will, 4:44for the coding agent to come in and play 4:47its part. And so the coding agent reads 4:49progress. The coding agent gets the 4:51history of previous commits from Git. 4:54The coding agent reads the feature list 4:56and picks a single failing feature to 4:58work on for this run. It then implements 5:01it. It tests it end to end. It will 5:03update the feature status as either 5:04failed or passing. It writes a progress 5:06note. It commits to get and it 5:08disappears. It has no more memory. It's 5:10gone because long running memory just 5:13doesn't work with these LLMs. We are 5:16building a memory scaffold because these 5:19LLMs need a setting to play their part 5:24to strut upon the stage. To quote 5:26Shakespeare, the agent is now just a 5:29policy that transforms one consistent 5:33memory state into another. The magic is 5:36in the memory. The magic is in the 5:38harness. The magic is not in the 5:40personality layer. And and harness is a 5:42fancy word for all the stuff that goes 5:43around the agent, right? It's the 5:44setting. It's what I'm describing. So 5:46the deeper lesson is that if you don't 5:48have domain memory, agents can't be 5:51longunning in any meaningful sense. And 5:54that is what anthropic is discovering. 5:56Although we've all sort of known that, 5:57but at least they're writing it up. And 5:58I really appreciate it. The core long 6:00horizon failure mode was not the model 6:04is too dumb. It was every session starts 6:08with no grounded sense of where we are 6:10in the world. And what they are doing to 6:12solve that is not make the model 6:15smarter, right? What they're doing to 6:16solve that is give the model a sense of 6:19its lived context. Now we would say 6:21instantiate it. And that's why it's 6:24called an initializer agent. It 6:25initializes the state so that the coding 6:28agent on every subsequent run knows 6:30where it is. If you have no shared 6:32feature list, think about it. Every run 6:34will rederive its own definition of 6:36done. If you have no durable progress 6:38log, every run will guess what happened 6:41wrongly. If you have no stable test 6:44harness or test pass in what counts as a 6:47successful software application and what 6:49counts as a successful unit test or 6:50feature test, everyone will discover a 6:53different sense of what works. And this 6:55is why when you loop an LLM with tools, 6:58it will just give you an infinite 7:00sequence of disconnected interns. It's 7:03just not going to work. And by the way, 7:05if you think there are implications here 7:07for prompting, you would be correct. So 7:10much of what we do with prompting is 7:13being that initializer agent. We are 7:16setting the context. We are setting the 7:18structure so that you can set up a 7:22successful activity for the agent. So, 7:24so when the LLM wakes up, as you hit 7:26enter on the chat, it knows where it is 7:28and it knows what the task is. It's a 7:30wonderful way of thinking about 7:32prompting. prompting is setting the 7:34stage so the agent can play its part. So 7:36domain memory forces agents to behave 7:39like disciplined engineers instead of 7:41like autocomplete. And so once you have 7:44a harness like the one Anthropic is 7:45describing or the one so many other 7:47companies are building, every single 7:49coding session starts by actually 7:51checking where the agent is, right? Like 7:53it reads the previous commit logs, it 7:55reads the progress files, it reads the 7:57feature list, and it picks something to 7:59work on. This is exactly how good humans 8:01behave on a shared codebase. They 8:03orient, they test, they change. The 8:06harness insists or bakes that discipline 8:09right into the agent by tying its 8:11actions to persistent domain memory, not 8:14to whatever happens to be in the current 8:15context window. That means that 8:17generalization moves up a layer from 8:20general agent as a concept to general 8:24harness pattern with a domain specific 8:26memory schema which is really fancy 8:28wording but it's important wording 8:31because it means this is not just for 8:33coders. You can use the same pattern of 8:35having a setting a context an agent that 8:39can do its task in that context and you 8:42can apply that beyond coding. You can 8:43apply that for any workflow where you 8:46need an agent to use tools to get 8:48something done and you need it to 8:51effectively have long-term memory when 8:54it actually doesn't. So the anthropic 8:56work implicitly suggests an a framing of 8:59agents that feels much more honest than 9:01a lot of the Twitter hype. You can have 9:03a relatively general agent harness 9:06pattern. You can use an initializer. You 9:08can build the scaffolding. You can have 9:10a repeated worker that reads memory and 9:12makes small testable progress and 9:14updates memory. That by the way doesn't 9:16have to be code, right? But you can only 9:18have that if your schemas and your 9:20rituals are domain specific. And I think 9:22part of why this is working for code is 9:25that we have rituals and we have schemas 9:28that we've all worked out and agreed on 9:30and that makes it easier here. Right? If 9:32you are working in development, you 9:34understand that having tests get 9:37progress logs feature list.json JSON 9:40those all make a ton of sense. We have 9:41to invent some of those and align on 9:43some of those in less technical 9:45disciplines. So for research it might 9:46look like a hypothesis backlog, an 9:48experiment registry, an evidence log, a 9:51decision journal. For operations it 9:53could look like a runbook, an incident 9:54timeline, a ticket queue, an SLA. So 9:57generalized agents are really just a 9:59meta pattern, right? They instantiate 10:00the same harness structure, but you have 10:03to design the right domain memory 10:05objects to make them real in a 10:07particular space to make them operations 10:09agents or research agents. What I'm 10:12telling you is that the magic pattern 10:15for general purpose agents lies in being 10:18domainspecific about their context. So 10:21this kills the idea of just drop an 10:23agent on your company and it will work. 10:25That was always a fantasy, but I really 10:28think we have good evidence to drop it 10:29here. If you buy the domain memory 10:31argument, you can write off a bunch of 10:33vendor claims right away. Right? A 10:36universal agent for your enterprise with 10:38no opinionated schemas on work or 10:41testing is a function that's going to 10:42thrash and go into the trash. If you can 10:45plug a model into Slack and you can call 10:48it an agent, I guess you can do that. 10:50But most of the time that's going to 10:52lead to problems because they're going 10:55to not have any kind of clean context or 10:58schema or all of the good structure 11:00stuff I talked about to work. Well, 11:02that's different from saying, "I want to 11:04have an agent that has an API hook or 11:06web hook into Slack to send messages." 11:08By the way, that happens all the time. 11:10But if you're trying to just give your 11:13agent a generalized context dump and 11:15expect it to work, that's not going to 11:17go well. The hard work is going to be 11:20designing artifacts and processes that 11:22define memory for domainspecific tasks 11:25for agents. The JSONs, the logs, the 11:28test harnesses that are not necessarily 11:30just for coding but for other tasks and 11:33disciplines too. So if you were to look 11:35at this and pull design principles out 11:37from this whole conversation around 11:40agents, I would suggest a few for any 11:42serious agent that you build, you want 11:44to externalize the goal. turn do X into 11:48something that is a machine readable 11:50backlog, right? Something with past fail 11:52criteria. Get really specific. You want 11:55to make progress atomic. You want to 11:57make it observable. You want to force 11:59the agent to pick one item. You want to 12:01work on it and then update a shared 12:03state. So progress needs to be something 12:05you can test and increment. You want to 12:07enforce the practice of leaving your 12:09campsite cleaner than you found it, 12:11right? You want to end every run with a 12:13clean test passing state with human and 12:15machine readable documentation. You want 12:18to standardize your bootup ritual, 12:20right? On every run, the agent must 12:21regground with the same exact protocol. 12:24Read the memory. Run basic checks. Then 12:26and only then do you act. You want to 12:28keep your tests close to memory. Right? 12:30Treat passes false and true as the 12:33source of truth for whether the domain 12:36is in a good state. In other words, if 12:38you are not tying in test results to 12:40memory, you're going to be in trouble. 12:42The strategic implication here, by the 12:44way, is that the moat isn't a smarter AI 12:46agent, which most people think it is, 12:49the mode is actually your domain, 12:50memory, and your harness that you have 12:52put together. It's a lot of work, right? 12:53Models will get better and models will 12:56be interchangeable. What won't be 12:57commoditized as quickly are the schemas 13:00that you define for your work, the 13:01harnesses that turn your LLM calls into 13:03durable progress, the testing loops that 13:06keep your agents honest. In a sense, the 13:08generalized agents fantasy is hiding 13:12from everyone a nice clean reusable 13:16harness pattern that we can use to build 13:18competitive differentiation with 13:21well-designed domain memory. We actually 13:23have a chance now to design really 13:25useful agents. And the whole purpose of 13:28this video has been to take the mystery 13:30out of it. The mystery of agents is 13:32memory. And this is how you solve