LLM Context Windows Explained in 5 Minutes |

In the world of large language models, what exactly is a "context window"? Think of it as the LLM's working memory. It dictates the length of a conversation the model can handle without losing track of details from earlier in the exchange.

A Simple Analogy

Let's illustrate this with a simple analogy. Imagine "blah" represents a prompt you send to an LLM chatbot. The chatbot then returns a response, "blah." The conversation continues back and forth:

You: Blah?
LLM: Blah.
You: Blah blah?
LLM: Blah blah.

A conceptual box, the context window, holds this entire conversation. When the LLM generates its latest response, it has full access to all the previous prompts and its own earlier responses within that window. Everything is clear and in memory.

The Problem with Long Conversations

Now, consider a much longer conversation that extends beyond the model's context window. The earliest parts of the conversation fall outside this window and are effectively forgotten. The model no longer has a memory of them when generating new responses.

While an LLM can try to infer what came before by analyzing the conversation still within its context window, it's essentially making educated guesses. This can lead to significant errors and hallucinations. Understanding how the context window operates is crucial for getting the most out of any LLM.

From Analogy to Technical Reality: Tokens

In reality, context window size isn't measured in abstract units but in tokens. To understand this, we need to explore tokenization, context length, and the challenges associated with long context windows.

What is a Token?

For humans, the smallest unit of language is a character—a letter, number, or punctuation mark. For AI models, the smallest unit is a token. A token can be a single character, but it can also represent part of a word, a whole word, or even a short phrase.

Let's look at a few examples to see how a tokenizer, the tool that converts language into tokens, works.

Example 1: "A person drove a car." In this sentence, the letter "a" is a complete word. It would be represented by its own distinct token.
Example 2: "That person is amoral." Here, "a" is not a standalone word but a prefix that dramatically changes the meaning of "moral." In this case, "amoral" would likely be split into two tokens: one for "a" and another for "moral."
Example 3: "The person loves their cat." The "a" in "cat" is just a letter within a word and has no separate meaning. It would not be a distinct token; the entire word "cat" would be a single token.

As a general rule of thumb, one hundred words in English translate to roughly 150 tokens.

Context Window Size and Self-Attention

So, how many tokens can a context window hold? To answer that, we need to understand how LLMs process them. Transformer models employ a self-attention mechanism to calculate the relationships and dependencies between different parts of an input, such as words at the beginning and end of a paragraph.

This mechanism computes vectors of weights, where each weight signifies how relevant a token is to every other token in the sequence. The size of the context window, therefore, determines the maximum number of tokens the model can "pay attention" to at once.

Context window sizes have been increasing rapidly. Early LLMs had context windows of around 2,000 tokens. Today, models like IBM's Granite can have context windows of 128,000 tokens, and some models feature even larger ones. This might seem like overkill, but numerous elements can occupy space within a model's context window.

What Fills a Context Window?

Several components can take up space in the context window, including:

User Input: The prompts you provide to the model.
Model Responses: The replies generated by the LLM.
System Prompt: A set of instructions, often hidden from the user, that conditions the model's behavior, defining its capabilities and restrictions.
Documents & Source Code: Users can attach documents or insert source code for the LLM to reference in its responses.
Retrieval-Augmented Generation (RAG): Supplementary information from external data sources can be stored in the context window during inference.

A few long documents and some code snippets can quickly fill up a context window.

The Challenges of Large Context Windows

So, a bigger context window is always better, right? Not necessarily. Larger context windows introduce several challenges.

1. Computational Cost

The computational requirements scale quadratically with the length of the input sequence. This means that doubling the number of input tokens requires four times the processing power. As the model predicts the next token, it computes its relationship with every single preceding token. As the context length grows, the computation required increases exponentially.

2. Performance Degradation

Just like people, LLMs can be overwhelmed by excessive detail. They can become "lazy" and resort to cognitive shortcuts. A 2023 study found that models perform best when relevant information is located at the beginning or end of the input context. Performance degrades when the model has to carefully consider information buried in the middle of a long context.

3. Security Risks

Longer context windows can present a larger attack surface for adversarial prompts. A long context length can increase a model's vulnerability to "jailbreaking," where malicious content is embedded deep within the input, making it harder for the model's safety filters to detect and block harmful instructions.

Conclusion

Ultimately, selecting the right context window size involves a trade-off. It's a balance between providing enough information for the model's self-attention mechanism to work effectively and managing the escalating computational demands and performance issues that additional tokens can introduce.

Podcast Title