One-Line Summary: Google DeepMind's Gemini 1.5, released in February 2024, introduced a Mixture of Experts architecture with an unprecedented 1 million token context window — later extended to 2 million — fundamentally redefining what it means to give a model "enough context."

Prerequisites: 01-attention-is-all-you-need.md, 07-gpt-4.md, 08-palm-2-and-gemini-evolution.md

What Is Gemini 1.5?

Imagine you could hand a librarian not just a single book and ask a question, but the entire shelf — a thousand books at once — and they could recall any detail from any page instantly. That is what Gemini 1.5 achieved with its million-token context window. While other frontier models were processing the equivalent of a long essay, Gemini 1.5 could ingest an hour of video, eleven hours of audio, or an entire codebase in a single prompt.

Google had been playing catch-up in the LLM race since ChatGPT's November 2022 launch. The original Gemini 1.0 (December 2023) was Google's answer to GPT-4, but its reception was mixed — benchmark controversies and a botched demo video had damaged credibility. Google DeepMind needed a clear, undeniable technical advantage. They found it in context length. While OpenAI and Anthropic were working with 128K-200K context windows, Google leapfrogged to 1 million tokens, a 5-10x advantage that no competitor could quickly replicate.

The announcement came on February 15, 2024, as a research paper and limited developer preview. Gemini 1.5 Pro was the first model released, positioned as matching the quality of Gemini 1.0 Ultra (the previous flagship) while being significantly more efficient. The secret behind both the efficiency and the long context was a fundamental architectural shift: moving from a dense Transformer to a Mixture of Experts (MoE) architecture.

How It Works

Mixture of Experts Architecture

Unlike dense models where every parameter activates for every token, Gemini 1.5 uses a Mixture of Experts design. The model contains many "expert" sub-networks, but for any given input, only a fraction of them activate. This means the total parameter count can be very large (providing capacity) while the active compute per token remains manageable. Google DeepMind did not disclose exact parameter counts, but the MoE approach is believed to give Gemini 1.5 Pro a total parameter count comparable to the largest dense models while activating a fraction of those parameters for each forward pass. This is the same fundamental approach used by 04-mistral-7b.md's Mixtral, but scaled to frontier capability.

Context window comparison -- Gemini 1.5 vs. competitors (early 2024):

Context Window Size (tokens):
 
Gemini 1.5 Pro   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  1,000,000
                  (= ~700K words / 1,400 pages / 1hr video)
 
Claude 3         ▓▓▓▓▓▓▓▓  200,000
 
GPT-4 Turbo      ▓▓▓▓▓  128,000
 
LLaMA 3          ▓▓▓▓▓  128,000
 
Mistral Large    ▓▓▓▓▓  128,000
 
GPT-4 (orig.)    ▓  32,000
 
LLaMA 2          ░  4,096
 
                  └──────────────────────────────────────────┘
                  Gemini 1.5: 5-250x larger than competitors
                  Needle-in-haystack recall: 99.7% at 1M tokens

The Million-Token Context Window

The headline achievement was the 1 million token context window, later extended to 2 million tokens for select users. In research settings, Google tested context lengths up to 10 million tokens. To put this in perspective: 1 million tokens is approximately 700,000 words, or about 1,400 pages of dense text. This is enough to hold the entire Lord of the Rings trilogy, multiple lengthy codebases, or hours of multimodal content.

Achieving this required innovations in how attention is computed and cached at extreme lengths. Standard quadratic attention would make million-token contexts computationally infeasible. While Google did not disclose all architectural details, the model likely employs efficient attention variants, improved positional encoding schemes, and optimized KV-cache management to make these extreme context lengths tractable.

Needle-in-a-Haystack Performance

The most impressive demonstration was near-perfect retrieval across the entire context. In "needle in a haystack" evaluations — where a specific piece of information is planted at a random position within a long document — Gemini 1.5 Pro achieved 99.7% recall accuracy at 1 million tokens. This was not just finding a keyword; the model could locate, retrieve, and reason about information buried anywhere in a massive context. It even demonstrated the ability to learn a new language (Kalamang, spoken by fewer than 200 people) from a grammar manual provided in its context window, correctly translating sentences it had never seen before.

Multimodal Long Context

The context window was not limited to text. Gemini 1.5 could natively process interleaved text, images, audio, and video within the same prompt. Concrete capabilities included: processing 1 hour of video and answering questions about specific scenes, analyzing 11 hours of audio content, reasoning over 30,000+ lines of code in a single prompt, and handling over 700,000 words of text. This multimodal long context was unique — no other model at the time could combine such extreme context length with native multimodal understanding.

Why It Matters

Long Context as a Strategic Differentiator

Gemini 1.5 established long context as Google's primary competitive moat. While other labs focused on reasoning benchmarks, Google argued that the ability to process entire documents, codebases, or media libraries in a single prompt unlocked fundamentally new capabilities. This was not just a quantitative improvement; it was qualitatively different. Tasks that previously required complex retrieval-augmented generation (RAG) pipelines could now be handled by simply providing all the information directly in the context window.

The MoE Efficiency Advantage

By demonstrating that a Mixture of Experts model could match or exceed dense model performance at lower inference cost, Gemini 1.5 validated the MoE approach for frontier models. This was an important data point in the architecture debate: DeepSeek (see 01-deepseek-v2-and-mla.md) and others would soon push MoE efficiency even further, but Gemini 1.5 showed a major lab successfully deploying MoE at the frontier scale.

Redefining What Fits in a Prompt

Before Gemini 1.5, prompt engineering was often about compression — how to squeeze enough context into a limited window. After Gemini 1.5, the question shifted to curation — what information is worth including when you can include nearly everything? This inversion changed how developers designed AI applications, reducing reliance on external retrieval systems and enabling "context stuffing" as a viable strategy.

Key Technical Details

  • Announced: February 15, 2024; broadly available mid-2024
  • Architecture: Mixture of Experts (MoE) Transformer
  • Context window: 1M tokens standard, 2M tokens extended, 10M tested in research
  • Needle-in-a-haystack: 99.7% recall accuracy at 1M tokens
  • Multimodal capacity: 1 hour video, 11 hours audio, 30K+ lines of code, 700K+ words in a single prompt
  • Gemini 1.5 Pro: Matched Gemini 1.0 Ultra quality at significantly lower cost
  • Gemini 1.5 Flash: Lighter variant optimized for speed and cost, released mid-2024
  • In-context language learning: Successfully learned Kalamang (a low-resource language) from a grammar book in context

Common Misconceptions

  • "A million tokens means the model reads everything equally well." While retrieval accuracy was extremely high (99.7%), there can still be subtle degradation in reasoning quality over very long contexts compared to shorter ones. The model retrieves facts well but complex multi-hop reasoning across distant context sections remains harder.

  • "Gemini 1.5 made RAG obsolete." For many use cases, yes, you can skip RAG and just put everything in context. But for truly massive knowledge bases (millions of documents), context windows are still insufficient, and RAG remains necessary. The cost of processing million-token prompts also favors RAG for high-volume applications.

  • "MoE models are strictly better than dense models." MoE introduces complexity in training stability, expert load balancing, and inference infrastructure. Meta deliberately chose dense architecture for LLaMA 3 (see 05-llama-3-and-3-1.md) precisely because of these challenges. The trade-offs are real.

  • "Google was behind before Gemini 1.5." Google had massive internal AI capabilities and had been using large language models in Search and other products for years. They were behind in the consumer chatbot race, but not in fundamental research capability.

Connections to Other Concepts

  • 08-palm-2-and-gemini-evolution.md — The journey from PaLM to PaLM 2 to Gemini 1.0 to Gemini 1.5
  • 01-claude-3-family.md — Anthropic's competitive response with 200K context, released weeks later
  • 04-mistral-7b.md — Mixtral demonstrated MoE at smaller scale
  • 01-deepseek-v2-and-mla.md — Pushed MoE efficiency further with innovative attention and expert designs
  • 03-gpt-4o.md — OpenAI's response emphasized multimodal integration over context length

Further Reading

  • Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context" (2024) — The technical report detailing architecture and evaluations.
  • Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (2017) — The foundational MoE paper from Google that Gemini 1.5 builds upon.
  • Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023) — Key analysis of how models handle information position in long contexts.
  • Google, "Gemini 1.5 Flash: Lightweight model for efficient performance" (2024) — The speed-optimized variant announcement.