Skip to content
AI for Knowledge Workers
21 / 24
E020

Why Long Chats Get Confused

Introduce RAG as a response to context and memory limits.

The LLM inside a chatbot has no memory — every response is generated fresh from the same frozen model. What feels like memory is the surrounding system assembling a prompt from conversation history and re-sending the whole thing each time; the model re-reads, it doesn't remember. When conversations grow too long for the context window, a different approach is used: retrieval-augmented generation (RAG) finds only the relevant pieces and adds them to the prompt at the moment they're needed.

Full Explanation

The LLM inside a chatbot has no memory. Every response is generated fresh from the same frozen trained model -- it doesn't retain anything between responses and doesn't become smarter as you chat. What feels like memory is actually the surrounding system assembling a prompt that includes the conversation history and re-sending the whole thing to the model each time. The model re-reads; it doesn't remember.

This architecture has a practical constraint: the context window. The model can only see a limited amount of text at once. When conversations grow long, earlier content gets compressed or removed -- and once it leaves the context window, the model cannot access it. For large documents and knowledge bases too big to fit in the prompt, a different approach is used: retrieval augmented generation (RAG). A retrieval system searches for the relevant pieces and adds only those to the prompt at the moment they're needed. The model answers using that context, without retaining any of it.