Unreasonable AI

The LLM inside a chatbot has no memory — every response is generated fresh from the same frozen model. What feels like memory is the surrounding system assembling a prompt from conversation history and re-sending the whole thing each time; the model re-reads, it doesn't remember. When conversations grow too long for the context window, a different approach is used: retrieval-augmented generation (RAG) finds only the relevant pieces and adds them to the prompt at the moment they're needed.

Full Explanation

The LLM inside a chatbot has no memory. Every response is generated fresh from the same frozen trained model -- it doesn't retain anything between responses and doesn't become smarter as you chat. What feels like memory is actually the surrounding system assembling a prompt that includes the conversation history and re-sending the whole thing to the model each time. The model re-reads; it doesn't remember.

This architecture has a practical constraint: the context window. The model can only see a limited amount of text at once. When conversations grow long, earlier content gets compressed or removed -- and once it leaves the context window, the model cannot access it. For large documents and knowledge bases too big to fit in the prompt, a different approach is used: retrieval augmented generation (RAG). A retrieval system searches for the relevant pieces and adds only those to the prompt at the moment they're needed. The model answers using that context, without retaining any of it.

Why Long Chats Get Confused

Full Explanation

Key Concepts