Latest Research · Module 36·6 min read

Kimi K2

Moonshot AI’s second-generation Kimi model. Built on the line’s defining bet — extreme long context — and updated with the reasoning-model techniques the rest of the field developed in 2024–2025.

Brain Drip EditorsUpdated May 2026·6 references

The five-bullet version

Kimi (from Beijing’s Moonshot AI) was an early long-context model — supporting context lengths in the multi-million-token range.
K2 is the second-generation release, building on K1.5 with stronger reasoning capabilities.
The long-context bet: rather than retrieve, just put the whole document (or repo, or book) in the prompt.
K2 combines that with modern post-training: SFT + RL, reasoning mode, tool use.
Part of the broader Chinese LLM ecosystem (Qwen, DeepSeek, Yi, GLM, Kimi) that became globally competitive in 2024–2026.

§ 00 · THE KIMI LINEMoonshot AI’s contribution

KimiKimi. A line of LLMs from Moonshot AI (Beijing), distinguished by extreme long-context support. The first Kimi model in 2023 offered 200k tokens; later versions reached 1M+ and reportedly 10M in some research configurations. K2 is the second-generation flagship. is the LLM line from Moonshot AI, a Beijing-based company that spent its early years differentiating on long context. Where most 2023 models stopped at 4k–32k tokens, Kimi was shipping 200k context from the start — before frontier US labs reached the same milestone.

The release line:

Kimi (2023). First public release. 200k context on a consumer-facing chatbot product.
Kimi K1 / K1.5 (2024). Continued scaling. Strong on document analysis and code-repo tasks.
Kimi K2 (2025). Reasoning-mode capable, agent- ready. Multi-million-token context support.

§ 01 · LONG CONTEXT AS THE ORIGINAL BETSkipping RAG by brute force

The thesis behind Kimi’s long-context emphasis: many tasks that look like retrieval-augmented generation problems can be solved by just stuffing the whole source into the prompt. If you have 1M tokens of context:

An entire codebase fits.
A book — a thick one — fits.
A year of meeting transcripts fits.

For these cases, retrieval is no longer a hard requirement. The engineering complexity of building a RAG system goes away. The trade-off is cost: long contexts are expensive to serve (see KV Cache lesson). For applications where the cost is acceptable, this is a real simplification.

§ 02 · WHAT K2 ADDEDSpecific updates

The K2 release pushed on three directions:

Reasoning mode. A toggle for long chain-of- thought reasoning, in the same family as o-series and R1. Trained with RL on verifiable rewards.
Agent-readiness. Native tool-calling support with strong instruction-following inside loops. Useful for code- agent workloads.
Better long-context retention. Empirical retention at 1M+ tokens improved measurably over K1.5. Solving the long-context-utilization problem (not just supporting long context) is the harder half.

§ 03 · WHERE THIS MODEL IS COMPETITIVEPractical positioning

Kimi K2 is in the bracket of strong open and semi-open Chinese models alongside Qwen 3, DeepSeek-V3/R1, and GLM-4. Comparative strengths:

Long-context workloads. Document analysis, code-base understanding, long-conversation memory. Kimi has held this niche for the longest.
Chinese-language tasks. Native-quality Chinese, with strong reasoning.
Application use via API. Moonshot offers Kimi via API at competitive rates. Strong choice for non-self-hosted deployments with long-context requirements.

§ 04 · THE LONG-CONTEXT ARMS RACEHow the field caught up

Long context was Kimi’s original differentiator, but by 2025 the field caught up:

Gemini 1.5 (Google) — 1M tokens, then 2M.
Claude — 200k stable, with research demos at much longer.
GPT-4 / GPT-5 — 128k, 1M.
Open models (Llama 3, Qwen, DeepSeek) — 128k typical, longer in research variants.

The Kimi line’s response has been to push further (10M-token research configurations) and to combine long context with stronger reasoning. Whether the long-context-first strategy remains a differentiator long-term depends on whether 1M+ becomes table stakes across the field.

CHECKA team is building a tool that summarizes year-long meeting transcripts (hundreds of meetings, ~2M tokens total). They want a single model, not a RAG pipeline. Which type of model fits best?

§ 05 · TAKING THIS FORWARDRelated context-length topics

For why long context is technically hard, see the KV Cache lesson — long contexts grow the cache linearly in tokens, which is the dominant inference-cost factor. For when to use long context vs RAG, see Context Engineering and Advanced RAG.

§ · GOING DEEPERLong context and reasoning in the Kimi line

Moonshot AI’s Kimi models established themselves on long-context document understanding — the original Kimi Chat offered 200K-token contexts when frontier models were still at 32K. The Kimi-K1.5 technical report (2024) documented the training recipe: a mixed-modal architecture, long-context RL, and the engineering required to make million-token inference economical.

Kimi K2 (2025) brought the reasoning-model recipe to the long-context regime: RL on verifiable rewards combined with retention of multi-million-token context handling. The long-context piece depends on a constellation of infrastructure work — position-encoding extensions like YaRN (Peng et al. 2023), serving optimizations for sparse attention, and training-time exposure to long sequences. Worth following for anyone interested in retrieval-free long-document workloads.

§ · FURTHER READINGReferences & deeper sources

Moonshot AI (2024). Kimi-K1.5: Scaling Reinforcement Learning with LLMs · arXiv
Chen et al. (2023). Extending Context Window of Large Language Models via Position Interpolation · arXiv
Peng, Quesnelle, Fan, Shippole (2023). YaRN: Efficient Context Window Extension of Large Language Models · ICLR
Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding · arXiv
Dao (2023). FlashAttention-2 · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.