Rendered at 14:04:14 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
SwellJoe 2 days ago [-]
I know everybody seems to want the agent to remember every conversation they've ever had with it, but I just don't see the value in that. In fact, it seems to hurt productivity to have the agent second guessing me based on something I said yesterday. Every time I've used any memory system, the agent gets distracted from the current tasks based on previous conversations and branches of development...often comingling unrelated projects (I work on code for work, open source projects, a bunch of unrelated side projects, etc.) and trying to satisfy requirements that don't make sense.
I've stopped trying to achieve general "memory". I just ask the agent to thoroughly, but concisely, document each project. If it writes developer documentation and a development plan/roadmap, as though a person was going to have to get up to speed and start working on the project, it provides all the information the agent needs tomorrow or next week to pick up where we left off.
The agent is not my friend. I don't need it to remember my birthday or the nasty thing I said about React last week. I need it to document what anyone, agent or human, would need to know to get productive in a particular repo, with no previous knowledge of the project.
Good, concise, developer and user documentation and a plan with checklists solves every problem people seem to think "memory" will solve: It tells the agent what tech stack to use (we hashed it out in planning), it tells it what commands it needs to run and test the app, it covers the static analysis tools in use (which formalizes code style, etc. in a way a vague comment I made a month ago cannot), and it is cheap. Markdown files are the native tongue of agents. No MCP, no skills, no API needed. Just read the file. It works for any agent, any model, and any human just getting started with the project.
Basically, I think memory makes agents dumber and less useful. I want it to focus on the task at hand.
pil0u 2 days ago [-]
I appreciate your comment, and can relate. I tested a couple of "memory" systems, doing some heavy lifting or seemingly implementation of theories (layering, hot memory, etc), I can't really tell if they improve performance, quality or reliability on a task. But they do increase the overhead, for the LLM and for me, that's for sure.
One problem I have is that now CLAUDE.md or skills tend to get version controlled within projects, I suspect they could get in the way sometimes.
There is already so much fatigue induced by these systems, adding another one willingly does sound crazy.
SachitRafa 1 days ago [-]
[dead]
mtrifonov 1 days ago [-]
You're right but I think you're describing flat memory. The agent gets distracted because every old fact has the same weight as the current one. That's a salience problem.
What works in production for me is typed memory with very different decay curves. Personality and relationships are essentially permanent. Preferences fade in months. Stated intent fades in weeks. Emotion and events fade in days. Reinforcement (repeated recall) keeps things alive regardless of type.
Cross-project co-mingling stops because project-specific stuff actually decays out of relevance while who the user is persists. There's also a filter on what even gets written, which scopes between globally and locally-relevant information and writes accordingly (if at all). Most of the noise you're describing comes from systems that store everything they observe.
Flat memory failing is real. Memory failing in general is a stronger claim than that.
SwellJoe 1 days ago [-]
I'm making the stronger claim. I don't think memory (at least, what people call "memory", even though it isn't...the memories LLMs have are baked in at training, everything else is context), no matter how fancy, improves outcomes, at least for the work I do on the software I work on. I just don't think the agent needs what people are calling memory.
I think the base truth is the code, which can be loaded into context at no greater cost than whatever "memory" system you're using, probably lower cost, actually. A few hints in documentation fills out the rest of the picture.
You can't realistically give an LLM memory, as current technology doesn't allow retraining the model on the fly. You can only give it more data to ingest into its context. Unless that data is directly relevant to the task at hand, it's probably detrimental. At best, it is just burning tokens for no benefit.
Terretta 1 days ago [-]
Research shows primed context has some equivalence to a fine tuning layer.
SwellJoe 19 hours ago [-]
Primed with what? Every random thought you dropped into the agent over the last couple months? I'll need to see that research.
netcan 1 days ago [-]
Useful comment. Thanks.
Kim_Bruning 23 hours ago [-]
I'm really curious to see your memory code, if you're sharing!
tommy29tmar 1 days ago [-]
[dead]
ohNoe5 2 days ago [-]
Yeah it's that lack of perfect recall, imo, that gives rise to intelligence and progress.
If we humans just did exactly what we did yesterday, what progress?
It's baked into the immutable constants of the universe for us; entropy, signal attenuation over distances... information breaks down over time.
Because of this all human social statistics trend towards zero with intentional conservatism. Progress is or collapse is all the universe affords. It doesn't seem interested in conservatism at all.
ohNoe5 1 days ago [-]
Oops I meant "without intentional conservatism"
And
"Progress or..." not "is or"
hbarka 2 days ago [-]
You still have to worry about handing off state into the next session, but you don’t want it loading (“just naive-read the files”) your stack of documents at every turn . It goes against the idea of progressive disclosure. Progressive disclosure scales.
hellohello2 2 days ago [-]
I can't see any value in having a global memory either, but can see the value for a local memory of a specific line of work. I.e. when implementing several features in a row that are related, you want the agent to remember what it did in the last chat.
giancarlostoro 2 days ago [-]
I prefer ticketing systems for AI. I dont care that it forgets what I did last week, I just need it to be able to compact its own memory and grab the next task once done.
SwellJoe 2 days ago [-]
I'm ambivalent about that. I've seen people use beads, and they're just making busy work for the agents, splitting stuff up into tiny tasks that could have been one-shotted as part of the larger plan. They seem to just enjoy making thinky machine go brrr, even when it makes the work take longer and burn a lot more tokens.
I tend to think developing with agents should look at lot like managing a human (like, I use feature-branch development with PRs and review them, even on my own projects that have no other devs and don't need a paper trail for security audit purposes), so I theoretically can get down with an issue based process, but thus far I haven't seen it done in a way that isn't just making busy work for agents.
giancarlostoro 2 days ago [-]
I started with Beads, then wound up building my own:
Key things: I added a concept called "gates" which are tied to all tasks, it forces the agent to do arbitrary requirements such as: ensure it still runs / compiles, run all tests, ensure they pass, review existing tests critically and point out if they're not comprehensive enough, and finally, get human confirmation on the task. Until the human confirms, just work on another task and so on.
I didn't like that Beads was built on top of Git, I don't always work on git friendly projects, and beads kept getting messed up if I switched branches. So I made mine SQLite based. I also made it so you can sync to github issues, and sync pre-existing (and new) github issues as guardrails tasks to be worked on, the agent will even leave a comment for you on github when it grabs an issue in order to let others know the work will be done potentially.
waterproof 1 days ago [-]
nice concept! Beads did not age all that well, and Claude doesn't really want to use it since the TodoList upgrade.
Do you have any tricks for getting Claude to use guardrails effectively alongside (or instead of) TodoList?
giancarlostoro 1 days ago [-]
It works hand in hand to be honest, because Claude will read tickets that match criteria of what I'm looking to work on, and tack them on to its todo list, it just becomes and overview of my tasks.
mrits 2 days ago [-]
I'm just thinking of youtube or amazon type algorithms applying here.
me: "Hi AI, can you debug this SQL Statement?"
ai: "Well,based on your passion for garden hoses and extensive research of refrigerators, I'm going to guess you really want to discuss that"
staticassertion 2 days ago [-]
I've had to remove any of the "knowledge" about me from any agent I use. "As a security engineer, blah blah blah" or "as a rust developer blah blah blah" even though my questions has nothing to do with those topics and they're a huge distraction.
SwellJoe 2 days ago [-]
Yeah, I've disabled memory in everything I use. It's super distracting to have it infer connections between conversations where there is none. It's also kind of sleazy feeling. Like, manipulative in the sense that it thinks it knows what I'm into so it's going to weave that into the conversation.
If we didn't have evidence that these things cause something like psychosis in some people, it'd seem innocent. But, since the sycophancy combines with the long-term relationships some people think they're having with matrix math to trigger serious mental health problems, it feels more sinister.
Anyway, having a long-term memory makes them dumber and more easily confused. I don't have any use for a dumb agent.
SachitRafa 1 days ago [-]
[dead]
SachitRafa 1 days ago [-]
[dead]
SachitRafa 1 days ago [-]
[dead]
carlovalenti 3 hours ago [-]
don't miss the (formerly over-hyped) Google Research's Titans+Atlas thread, they are quite inspiring and informative on the subject; even if I don't think they released anything (open, at least) as a follow-up to their research.
https://arxiv.org/abs/2501.00663https://arxiv.org/abs/2505.23735
xcf_seetan 2 days ago [-]
It strikes me as funny how we want to get super AI inteligence but keep trying to anthropomorphizing all AI aspects to make it more "human". IMHO, if we keep doing it we will create Human AI with all errors and deficiencies humans have.
arvid-lind 2 days ago [-]
Well, it's an effort by the few to eliminate the need for other humans, so maybe that's what they want. Call it "artificial creativity".
michelhabib 23 hours ago [-]
hahaha, that's true. Intelligent and human may not be efficient, as humanity is much flawed by design
TZubiri 2 days ago [-]
Do you think humans don't have perfect memory because it's hard to achieve and millions of years of evolution haven't been able to? Or because it's convenient to forget in order to prioritize the more important recent information?
It's obviously the latter, a system that 'remembers everything perfectly' is probably not optimal in most senses. Mortality is a property of both life and artificial systems, forcing the same retention policy on new information and old information probably does so at the expense of lifespan or stability.
xcf_seetan 2 days ago [-]
I think its the latter also. What i was saying is more that we want God AI like but work towards more Human AI like.
TZubiri 2 days ago [-]
Well it was believed that (hu)man was made in the image of God, so perhaps reaching god involves maintaining and even furthering our human-like traits.
I think design-by-nature is consistent with seeking perfection, of course it won't ever be achieved, but organic inspirations can and often help maximize a lot of parameters.
coalstartprob 2 days ago [-]
[dead]
krupan 2 days ago [-]
What powers your AI? Does it have any waste products? Does any of the hardware need to go down for maintenance ever?
rl3 1 days ago [-]
>... Does it have any waste products? ...
There's now commercially-available computers that operate using human neurons.
I figure before too long we'll be feeding our computers Pepto-Bismol and Tums.
ArielTM 13 hours ago [-]
A different axis that holds up: a tiny always-loaded index pointing at per-fact files that load on demand. Claude Code's auto-memory uses this. The part that does the work isn't the lookup, it's the index itself acting as a filter. Every new fact has to summarize into one line that earns its keep, or it doesn't get saved. Stuff that misses the bar either gets rediscovered during work or wasn't load-bearing. 52% Recall@5 from forgetting is a real research win, but the production lever has been bounding the always-loaded set, not scoring what to forget
SachitRafa 12 hours ago [-]
Hi all builder here, quick update. Benchmarked it against LongMemEval dataset and got the result 84.8% for recall top 5 and 86.8% for nDCG any@5. All the methodology and results are mentioned in the repo !
K0balt 2 days ago [-]
I planned and supervised the build of an ambient recall system, where a 4b model looks at the last 3k or so of context and picks through the RAG database for high ranking memories to inject, as well as mineable things to mark. Injections happens about 1/5 turns on most technical topics, data picked from prior design docs and data sheets mostly. At session wrapup the inference model goes back and rates all the memory injections in a frontmatter section, then looks at all the memory suggestions to commit those it finds memorable to the RAG database. Manual memorisation and RAG search are also available inline in the chat to both the user and the model. It also allows the main model to spawn little models as minions to work on repetitive simple tasks.
Seems to maybe be useful but I’m not sure yet.
tpoacher 24 hours ago [-]
Not something I've (yet) pursued, buy I did wonder a few days back if there was a good analogy between context window and short term memory, and storage with long term memory, and if so might an anki-like algorithm lead to better contexts by keeping relevant / difficult "memories" for the AI fresher (via spaced repetition), in an efficient manner.
SachitRafa 12 hours ago [-]
Also if any of you are curious on how to set it up. You only need to execute these 2 commands
pip install yourmemory
yourmemory-setup
waterbuffaloai 2 days ago [-]
I am also building a similar memory structure and decay mechanism for my local agent project, where I also use Ebbinghaus.
One of the challenge I face is how to decide effectively what to save in the memory: Is it the model to decide what is important, summarize and save it to the memory? How to avoid redundancy and categorize the memory correctly so you could get the right hit and decide what to forget.
I would love to learn more about your approach and what your thoughts on those points
tra3 2 days ago [-]
I haven’t had much like with memory implementations. I tried a few.
What I do now is preserve all my claude code conversations and set the context from there.
This allows me to curate memory and it’s been the best way so far.
SachitRafa 1 days ago [-]
[dead]
larrydakhissi 2 days ago [-]
you just make Alzheimer a feature lol , but seriously this is very interesting
keeda 2 days ago [-]
Missed opportunity to call it AIzheimer? ;-)
axeldunkel 2 days ago [-]
I only use a decay function to see how "hot" a chunk is - not for forgetting old ones. What concerns me more are memory chunks with errors in them - they need to be corrected/removed by some other mechanism, not by decay (since they might get retrieved often).
SachitRafa 1 days ago [-]
[dead]
2 days ago [-]
0-_-0 1 days ago [-]
It's the cumulative weighting based on the softmax output? Is it per layer?
Kim_Bruning 23 hours ago [-]
Everyone and their pet dog is making longer term memory systems at the same time, and they all seem kind of meh. Not casting aspersions here, my own attempts all crash and burn too. And better than nothing is still better than nothing.
Thing is, this seems like it might be a Hard Problem of some sort. Everyone trying, no one making a clean breakthrough, I feel like it's some sort of smell. Either the desired function isn't well understood, or there's something missing, or it's in some weird complexity class, or ... something. My spidey senses tingle.
I wonder if others have the same feeling?
cyanydeez 2 days ago [-]
on the other "biological memory" post in so many weeks, I pointed out that the decay rate shouldn't be based on a real clock but a lifetime of it's use within the coding session. Elsewise your memory fades even when there's no process change (eg, coder goes on vacation). I'm not going to check whether thats true here, but it seems like a naive first assumption thats failed conceptualization.
The other comment is that spatial memory is probably a better trigger for memory, so if you're not tracking where the coding session starts, the folders it's visits, etc, then you're not really providing a good associative footpath for the assistant to retrieve whats important for any given project.
SachitRafa 1 days ago [-]
[dead]
altmanaltman 2 days ago [-]
I am sorry but the whole "biological memory" thing seems like marketing fluff on basic cache mechanisms.
You said it cuts token usage by 84% but isn't that typical for any typical chunked RAG system?
And why did you specifically chose to test against the LoMoCo dataset when there's a lot of issues with it and it being very easy to cheat?
xhevahir 2 days ago [-]
And a neural network is really just a composed, non-linear parameterized function that maps input vectors to output vectors. Sometimes metaphors or analogies do contribute something valuable.
throawayonthe 2 days ago [-]
isn't that an example of an analogy being more misleading than useful
mtrifonov 1 days ago [-]
Decay-as-eviction is just LRU, fair. Type-conditional half-life is worth defending, though.
A user's job and personality should be effectively permanent. Their stated intent for this week should fade in days. Their emotional state from a single message should be gone by tomorrow. Decay everything at one rate and you're back to LRU with the problems you're calling out.
The "biological" framing isn't really doing much work. Ebbinghaus is one curve and fine, but it's not where the leverage is. Type-conditional half-life is. Without that, this is a cache.
jnovek 2 days ago [-]
I think it’s reasonable, a forgetting curve is intended to models a biological process.
I've stopped trying to achieve general "memory". I just ask the agent to thoroughly, but concisely, document each project. If it writes developer documentation and a development plan/roadmap, as though a person was going to have to get up to speed and start working on the project, it provides all the information the agent needs tomorrow or next week to pick up where we left off.
The agent is not my friend. I don't need it to remember my birthday or the nasty thing I said about React last week. I need it to document what anyone, agent or human, would need to know to get productive in a particular repo, with no previous knowledge of the project.
Good, concise, developer and user documentation and a plan with checklists solves every problem people seem to think "memory" will solve: It tells the agent what tech stack to use (we hashed it out in planning), it tells it what commands it needs to run and test the app, it covers the static analysis tools in use (which formalizes code style, etc. in a way a vague comment I made a month ago cannot), and it is cheap. Markdown files are the native tongue of agents. No MCP, no skills, no API needed. Just read the file. It works for any agent, any model, and any human just getting started with the project.
Basically, I think memory makes agents dumber and less useful. I want it to focus on the task at hand.
One problem I have is that now CLAUDE.md or skills tend to get version controlled within projects, I suspect they could get in the way sometimes.
There is already so much fatigue induced by these systems, adding another one willingly does sound crazy.
What works in production for me is typed memory with very different decay curves. Personality and relationships are essentially permanent. Preferences fade in months. Stated intent fades in weeks. Emotion and events fade in days. Reinforcement (repeated recall) keeps things alive regardless of type.
Cross-project co-mingling stops because project-specific stuff actually decays out of relevance while who the user is persists. There's also a filter on what even gets written, which scopes between globally and locally-relevant information and writes accordingly (if at all). Most of the noise you're describing comes from systems that store everything they observe.
Flat memory failing is real. Memory failing in general is a stronger claim than that.
I think the base truth is the code, which can be loaded into context at no greater cost than whatever "memory" system you're using, probably lower cost, actually. A few hints in documentation fills out the rest of the picture.
You can't realistically give an LLM memory, as current technology doesn't allow retraining the model on the fly. You can only give it more data to ingest into its context. Unless that data is directly relevant to the task at hand, it's probably detrimental. At best, it is just burning tokens for no benefit.
If we humans just did exactly what we did yesterday, what progress?
It's baked into the immutable constants of the universe for us; entropy, signal attenuation over distances... information breaks down over time.
Because of this all human social statistics trend towards zero with intentional conservatism. Progress is or collapse is all the universe affords. It doesn't seem interested in conservatism at all.
And
"Progress or..." not "is or"
I tend to think developing with agents should look at lot like managing a human (like, I use feature-branch development with PRs and review them, even on my own projects that have no other devs and don't need a paper trail for security audit purposes), so I theoretically can get down with an issue based process, but thus far I haven't seen it done in a way that isn't just making busy work for agents.
https://github.com/Giancarlos/guardrails
Key things: I added a concept called "gates" which are tied to all tasks, it forces the agent to do arbitrary requirements such as: ensure it still runs / compiles, run all tests, ensure they pass, review existing tests critically and point out if they're not comprehensive enough, and finally, get human confirmation on the task. Until the human confirms, just work on another task and so on.
I didn't like that Beads was built on top of Git, I don't always work on git friendly projects, and beads kept getting messed up if I switched branches. So I made mine SQLite based. I also made it so you can sync to github issues, and sync pre-existing (and new) github issues as guardrails tasks to be worked on, the agent will even leave a comment for you on github when it grabs an issue in order to let others know the work will be done potentially.
Do you have any tricks for getting Claude to use guardrails effectively alongside (or instead of) TodoList?
me: "Hi AI, can you debug this SQL Statement?"
ai: "Well,based on your passion for garden hoses and extensive research of refrigerators, I'm going to guess you really want to discuss that"
If we didn't have evidence that these things cause something like psychosis in some people, it'd seem innocent. But, since the sycophancy combines with the long-term relationships some people think they're having with matrix math to trigger serious mental health problems, it feels more sinister.
Anyway, having a long-term memory makes them dumber and more easily confused. I don't have any use for a dumb agent.
It's obviously the latter, a system that 'remembers everything perfectly' is probably not optimal in most senses. Mortality is a property of both life and artificial systems, forcing the same retention policy on new information and old information probably does so at the expense of lifespan or stability.
I think design-by-nature is consistent with seeking perfection, of course it won't ever be achieved, but organic inspirations can and often help maximize a lot of parameters.
There's now commercially-available computers that operate using human neurons.
I figure before too long we'll be feeding our computers Pepto-Bismol and Tums.
Seems to maybe be useful but I’m not sure yet.
pip install yourmemory yourmemory-setup
What I do now is preserve all my claude code conversations and set the context from there.
This allows me to curate memory and it’s been the best way so far.
Thing is, this seems like it might be a Hard Problem of some sort. Everyone trying, no one making a clean breakthrough, I feel like it's some sort of smell. Either the desired function isn't well understood, or there's something missing, or it's in some weird complexity class, or ... something. My spidey senses tingle.
I wonder if others have the same feeling?
The other comment is that spatial memory is probably a better trigger for memory, so if you're not tracking where the coding session starts, the folders it's visits, etc, then you're not really providing a good associative footpath for the assistant to retrieve whats important for any given project.
You said it cuts token usage by 84% but isn't that typical for any typical chunked RAG system?
And why did you specifically chose to test against the LoMoCo dataset when there's a lot of issues with it and it being very easy to cheat?
A user's job and personality should be effectively permanent. Their stated intent for this week should fade in days. Their emotional state from a single message should be gone by tomorrow. Decay everything at one rate and you're back to LRU with the problems you're calling out.
The "biological" framing isn't really doing much work. Ebbinghaus is one curve and fine, but it's not where the leverage is. Type-conditional half-life is. Without that, this is a cache.
https://en.wikipedia.org/wiki/Forgetting_curve