Skip to content
Notes on AIE020Act 2 — Behavior & Limits

Why Long Chats Get Confused

Introduce RAG as a response to context and memory limits.

The LLM inside a chatbot has no memory. Every response is generated fresh from the same frozen trained model -- it doesn't retain anything between responses and doesn't become smarter as you chat. ...

Full Explanation

The LLM inside a chatbot has no memory. Every response is generated fresh from the same frozen trained model -- it doesn't retain anything between responses and doesn't become smarter as you chat. What feels like memory is actually the surrounding system assembling a prompt that includes the conversation history and re-sending the whole thing to the model each time. The model re-reads; it doesn't remember.

This architecture has a practical constraint: the context window. The model can only see a limited amount of text at once. When conversations grow long, earlier content gets compressed or removed -- and once it leaves the context window, the model cannot access it. For large documents and knowledge bases too big to fit in the prompt, a different approach is used: retrieval augmented generation (RAG). A retrieval system searches for the relevant pieces and adds only those to the prompt at the moment they're needed. The model answers using that context, without retaining any of it.

Resources

No dedicated resources for this episode yet.

Browse the resource library →

More from Act 2 — Behavior & Limits

Alexey Makarov

Alexey Makarov

AI Enablement Strategist and Educator. Leading the AI Center of Excellence at SEFE. Creator of the Unreasonable AI YouTube channel. Based in Berlin.

About Alexey →