You tell an AI your situation. It nails the answer. Then 20 messages later it acts like it never heard half of it.
That’s not mood swings. That’s context.
LLMs don’t have “memory” the way people imagine. They have something closer to a temporary workspace: whatever text is currently included in the request (the chat history + your latest message + any pasted docs). If important details fall out of that workspace, the model can’t use them.
The desk metaphor
Imagine you’re working with a super-smart assistant, but they can only see what’s on their desk:
- Your current message
- Some amount of chat history
- Any documents you pasted in
That desk has a fixed size. If you keep adding paper, older pages slide off. The assistant isn’t refusing to remember — the paper is simply not there anymore.
That desk size is the context window.
What is a context window?
A context window is the maximum amount of text the model can consider at once.
Important detail: it’s not “how much you can paste in.” It’s the combined total of:
- input (your text + conversation history + tool results / docs)
- output (what the model generates)
So if you ask for a huge answer, you’re spending the same budget that could have kept more history visible.
Tokens: the units AI counts (not words)
Models don’t count “words”. They count tokens.
A token is a chunk of text produced by a tokenizer. It can be:
- a whole word (
house) - part of a word (
extra+ordinary) - punctuation (
.,{}) - a space + word (often
" hello"is one token) - parts of code identifiers (
CustomerOrderIDmight split)
Quick intuition (rough but useful)
- English prose: ~1 token ≈ 4 characters on average
- Code / JSON / logs: often more tokens per visible character (lots of symbols + long identifiers)
- Languages with diacritics: tokenization can be slightly less efficient depending on the tokenizer
This is why pasting “just a few logs” can destroy your context budget.
Why context limits create the “AI got dumb” effect
1) It loses your constraints
You say early:
- “Don’t use OFFSET pagination”
- “We must keep transaction boundaries”
- “Naming convention is fixed”
Later, those rules may no longer be in the active context. The model switches back to generic defaults.
2) It becomes inconsistent
If it can’t see your earlier decisions, it may confidently suggest the opposite approach — not because it’s lying, but because it’s now optimizing for a different (incomplete) picture.
3) It starts “filling gaps”
When a key detail is missing, the model predicts the most likely continuation. That can look like confident facts, but it’s basically a high-quality guess.
That’s the origin of a lot of hallucinations in long threads: missing context + plausible completion.
Does the AI remember anything long-term?
Most of the time, no — not in the way people think.
There are two separate things:
- Context (short-term): what’s inside the current conversation window
- Training (long-term): knowledge learned during training, not your personal chat
Some products add extra features like “memory”, summaries, profiles, or saved instructions. But that’s external state handled by the app, not the model magically learning your life.
How serious AI apps “cheat” context limits (the right way)
If an app wants the model to behave like it remembers a lot, it usually uses one (or more) of these patterns:
1) Summaries (a.k.a. compaction)
Older parts of the conversation are compressed into a short summary like:
- Project uses Postgres
- No OFFSET
- Batch size 500k
- Data quality logging is mandatory
So instead of 50 messages, the model keeps a small “rules + decisions” page on the desk.
2) Retrieval (RAG)
Instead of pasting your whole knowledge base every time, the system:
- searches your docs/code/logs
- picks the most relevant chunks
- injects only those chunks into the model’s context
- answers grounded on what it retrieved
This is how you scale from “chat toy” to “enterprise assistant that doesn’t drift.”
3) Tools (real work is tool-driven)
Good coding assistants don’t rely on memory. They:
- read files
- search the repo
- run commands
- query databases
- fetch logs
That makes answers deterministic because the model is anchored in fresh evidence instead of vague recollection.
Practical tricks that make AI way more reliable
These are boring, but they work.
1) Keep a “Pinned Facts” block
At the top of the conversation (or in a doc you paste repeatedly), keep something like:
Pinned Facts:
- Goal: migrate MSSQL → Postgres ETL
- Constraints: no OFFSET, prefer batching, verbose logs
- Schemas: staging, infradb
- Output: Python + SQL snippets, production-safe
When the thread gets long, paste the latest version again. You just re-loaded the desk with the rules that matter.
2) Put constraints before details
Bad order: dump data → mention constraints at the end
Good order: constraints → input → example → expected output
Models follow structure. If constraints are late, they get violated.
3) Ask for assumptions, not guesses
One sentence that changes behavior:
“If something is unclear, list assumptions and ask questions instead of inventing details.”
4) Don’t paste huge dumps unless you must
Instead of 3000 log lines, paste:
- error lines
- ~20 lines around the error
- versions (driver, OS, DB)
- what changed last
If you truly need everything, use retrieval/search over the dump instead of stuffing it into the prompt.
5) For refactors: go file-by-file
Even with large context models, this is safer:
- ask for a plan + file list
- generate diffs for 1–3 files
- apply + run tests
- repeat
This prevents drift and keeps decisions stable.
Why bigger context isn’t the whole story
A bigger context window helps, but two models with the same token limit can behave very differently:
- some are better at pulling the right detail from earlier text
- some follow strict constraints better
- some produce cleaner code with fewer subtle bugs
- some are better at summarizing and staying consistent over long sessions
So “more tokens” is not the same as “more reliable.” It’s just more workspace.
The takeaway
- Tokens are the units LLMs count.
- The context window is the maximum tokens the model can “see” at once.
- When important info falls out of context, the model doesn’t remember it — and it starts acting inconsistent.
- Real solutions are: pinned facts, summaries, retrieval, and tool-based workflows.
If you want, tell me your target audience for this post (general people vs business owners vs devs) and I’ll rewrite the same topic with examples tailored to your world (ETL pipelines, logs, code reviews, incident response).













