Qwen 3
Alibaba’s third-generation open model family. By 2025–2026, the Qwen line had become one of the strongest open-weights options across the size spectrum — from sub-1B SLMs to massive MoE checkpoints.
The five-bullet version
- Qwen is Alibaba’s open-source LLM line; Qwen 3 is the third major release, building on Qwen 1 (2023) and Qwen 2 / 2.5 (2024).
- Mixed dense + MoE family. Sizes from ~0.5B small models to multi-hundred-B mixture-of-experts.
- Strong multilingual coverage (Chinese-first, but globally competitive).
- Long context support, decoder-only transformer architecture with GQA and RoPE.
- Significant for the open ecosystem: a high-quality alternative to Llama with permissive licensing.
§ 00 · THE QWEN LINEThree generations
QwenQwen. A family of open-weight large language models from Alibaba's DAMO Academy. First released in 2023; the Qwen 2.5 generation in 2024 brought the line to broad parity with Llama-3 on benchmarks. Qwen 3 (2025) further extended capabilities, especially in long context and reasoning. is Alibaba’s open-source LLM family. The lineage:
- Qwen 1 (2023). Decoder-only transformer, sizes from 1.8B to 72B. First major open Chinese-developed LLM with competitive English benchmarks.
- Qwen 2 / Qwen 2.5 (2024). Improved tokenizer, extended context (up to 128k), introduction of MoE variants. Strong adoption in production deployments.
- Qwen 3 (2025). Continued scaling, deeper RL post-training, native reasoning variants. Strong on small (3B, 7B) and large (70B+, MoE) tiers alike.
§ 01 · WHAT QWEN 3 ADDEDSpecific advances
The Qwen 3 release emphasizes several specific improvements over Qwen 2.5:
- Reasoning-mode toggles.Several Qwen 3 variants can switch between “direct answer” and “long chain-of-thought” reasoning depending on the request. Mirrors the o-series / Claude-extended-thinking approach.
- Tighter post-training.RL on verifiable rewards for math and code (parallel to DeepSeek-R1’s approach).
- Improved tool use. Native function-calling support tuned for agent loops.
- Better quantization recipes. Q4 quantization with minimal quality loss, important for laptop / phone deployment.
§ 02 · THE DENSE + MOE FAMILYSize spectrum
Qwen 3 ships across a wide size range:
- Small dense: 0.5B / 1.5B / 3B / 7B / 14B — runs on laptops, phones, and edge accelerators.
- Mid dense: 32B / 72B — for serious self-hosted deployments.
- MoE variants: total params in the hundreds of billions but with only ~30–40B active per token. Frontier-class capability with mid-class inference cost.
The MoE strategy follows the same playbook as Mixtral, DeepSeek-V3, and others: many specialized expert subnetworks, with a routing mechanism that picks a few experts per token. Capacity scales with the number of experts; per-token compute scales with the number active.
§ 03 · MULTILINGUAL AND LONG-CONTEXTDifferentiators
Two strengths of the Qwen line that have stayed visible in independent evaluations:
- Multilingual. Strong performance on Chinese benchmarks (where Western-built models often lag), competitive on most other major languages. Useful when the target audience or data is non-English.
- Long context. 128k and (in some variants) up to 1M tokens, with measurable retention beyond the first few thousand tokens. Especially relevant for document AI and code repositories.
§ 04 · WHAT THIS MODEL FAMILY REPRESENTSEcosystem implications
Qwen 3 — alongside DeepSeek, Meta Llama, Google Gemma, Mistral — is part of the open-weights wave that has kept frontier-adjacent capabilities outside any single company’s control. Three practical implications for application teams:
- Self-hostable competitive models. You can run a model close to frontier quality on your own hardware. Useful for regulated industries, data residency, cost control.
- Fine-tuning is in reach. A 7B Qwen 3 fine-tune with LoRA fits on a consumer GPU. You can customize behavior on your domain without renting frontier-class compute.
- Vendor flexibility. Production stacks designed to swap between Llama, Qwen, DeepSeek, and proprietary models keep leverage at the negotiating table.
§ 05 · TAKING THIS FORWARDRelated model families
The other big open-weights model families covered in this drip series include DeepSeek (see DeepSeek-Math and DeepSeek-OCR) and the SLM section (Phi, Gemma, Llama-3.2 small). The open ecosystem is multi-polar in 2026 in a way it wasn’t in 2023.
§ · GOING DEEPERWhat Qwen 3 actually changed
The Qwen line from Alibaba has been one of the most consistent open-weights families. Qwen 2 (Bai et al. 2024) introduced GQA, sliding-window attention in long-context variants, and improved tokenization for non-English text. Qwen 2.5 expanded the SKU coverage — dense and MoE variants, math and code specialists, multilingual capabilities. Qwen 3 (2025) continued the pattern: improved post-training, better long-context utilization, reasoning-mode variants competitive with closed frontier models on math benchmarks.
The practical takeaway for builders: Qwen offers the best per-cost performance for multilingual workloads and is permissively licensed for commercial use. It’s especially strong in Chinese and Asian languages where Llama-family models have historically been weaker. The ecosystem of Qwen fine-tunes — Qwen2-VL for vision, Qwen-Audio for speech, Qwen-Coder for code — gives you building blocks for multimodal applications without needing to roll your own.
§ · FURTHER READINGReferences & deeper sources
- (2023). Qwen Technical Report · arXiv
- (2024). Qwen2 Technical Report · arXiv
- (2023). Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond · arXiv
- (2024). Qwen2.5: A Party of Foundation Models · Qwen Blog
- (2025). Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens · Qwen Blog
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.