Kimi K2
Moonshot AI’s second-generation Kimi model. Built on the line’s defining bet — extreme long context — and updated with the reasoning-model techniques the rest of the field developed in 2024–2025.
The five-bullet version
- Kimi (from Beijing’s Moonshot AI) was an early long-context model — supporting context lengths in the multi-million-token range.
- K2 is the second-generation release, building on K1.5 with stronger reasoning capabilities.
- The long-context bet: rather than retrieve, just put the whole document (or repo, or book) in the prompt.
- K2 combines that with modern post-training: SFT + RL, reasoning mode, tool use.
- Part of the broader Chinese LLM ecosystem (Qwen, DeepSeek, Yi, GLM, Kimi) that became globally competitive in 2024–2026.
§ 00 · THE KIMI LINEMoonshot AI’s contribution
KimiKimi. A line of LLMs from Moonshot AI (Beijing), distinguished by extreme long-context support. The first Kimi model in 2023 offered 200k tokens; later versions reached 1M+ and reportedly 10M in some research configurations. K2 is the second-generation flagship. is the LLM line from Moonshot AI, a Beijing-based company that spent its early years differentiating on long context. Where most 2023 models stopped at 4k–32k tokens, Kimi was shipping 200k context from the start — before frontier US labs reached the same milestone.
The release line:
- Kimi (2023). First public release. 200k context on a consumer-facing chatbot product.
- Kimi K1 / K1.5 (2024). Continued scaling. Strong on document analysis and code-repo tasks.
- Kimi K2 (2025). Reasoning-mode capable, agent- ready. Multi-million-token context support.
§ 01 · LONG CONTEXT AS THE ORIGINAL BETSkipping RAG by brute force
The thesis behind Kimi’s long-context emphasis: many tasks that look like retrieval-augmented generation problems can be solved by just stuffing the whole source into the prompt. If you have 1M tokens of context:
- An entire codebase fits.
- A book — a thick one — fits.
- A year of meeting transcripts fits.
For these cases, retrieval is no longer a hard requirement. The engineering complexity of building a RAG system goes away. The trade-off is cost: long contexts are expensive to serve (see KV Cache lesson). For applications where the cost is acceptable, this is a real simplification.
§ 02 · WHAT K2 ADDEDSpecific updates
The K2 release pushed on three directions:
- Reasoning mode. A toggle for long chain-of- thought reasoning, in the same family as o-series and R1. Trained with RL on verifiable rewards.
- Agent-readiness. Native tool-calling support with strong instruction-following inside loops. Useful for code- agent workloads.
- Better long-context retention. Empirical retention at 1M+ tokens improved measurably over K1.5. Solving the long-context-utilization problem (not just supporting long context) is the harder half.
§ 03 · WHERE THIS MODEL IS COMPETITIVEPractical positioning
Kimi K2 is in the bracket of strong open and semi-open Chinese models alongside Qwen 3, DeepSeek-V3/R1, and GLM-4. Comparative strengths:
- Long-context workloads. Document analysis, code-base understanding, long-conversation memory. Kimi has held this niche for the longest.
- Chinese-language tasks. Native-quality Chinese, with strong reasoning.
- Application use via API. Moonshot offers Kimi via API at competitive rates. Strong choice for non-self-hosted deployments with long-context requirements.
§ 04 · THE LONG-CONTEXT ARMS RACEHow the field caught up
Long context was Kimi’s original differentiator, but by 2025 the field caught up:
- Gemini 1.5 (Google) — 1M tokens, then 2M.
- Claude — 200k stable, with research demos at much longer.
- GPT-4 / GPT-5 — 128k, 1M.
- Open models (Llama 3, Qwen, DeepSeek) — 128k typical, longer in research variants.
The Kimi line’s response has been to push further (10M-token research configurations) and to combine long context with stronger reasoning. Whether the long-context-first strategy remains a differentiator long-term depends on whether 1M+ becomes table stakes across the field.
§ 05 · TAKING THIS FORWARDRelated context-length topics
For why long context is technically hard, see the KV Cache lesson — long contexts grow the cache linearly in tokens, which is the dominant inference-cost factor. For when to use long context vs RAG, see Context Engineering and Advanced RAG.
§ · GOING DEEPERLong context and reasoning in the Kimi line
Moonshot AI’s Kimi models established themselves on long-context document understanding — the original Kimi Chat offered 200K-token contexts when frontier models were still at 32K. The Kimi-K1.5 technical report (2024) documented the training recipe: a mixed-modal architecture, long-context RL, and the engineering required to make million-token inference economical.
Kimi K2 (2025) brought the reasoning-model recipe to the long-context regime: RL on verifiable rewards combined with retention of multi-million-token context handling. The long-context piece depends on a constellation of infrastructure work — position-encoding extensions like YaRN (Peng et al. 2023), serving optimizations for sparse attention, and training-time exposure to long sequences. Worth following for anyone interested in retrieval-free long-document workloads.
§ · FURTHER READINGReferences & deeper sources
- (2024). Kimi-K1.5: Scaling Reinforcement Learning with LLMs · arXiv
- (2023). Extending Context Window of Large Language Models via Position Interpolation · arXiv
- (2023). YaRN: Efficient Context Window Extension of Large Language Models · ICLR
- (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding · arXiv
- (2023). FlashAttention-2 · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.