DeepSeek-OCR
Optical character recognition, redone with modern vision-language models. DeepSeek’s OCR variant — and the wave of similar models — pushed document understanding from clean printed text to messy real-world scans, layouts, tables, and handwriting.
The five-bullet version
- OCR — optical character recognition — used to be a specialized vision pipeline (preprocess, detect, recognize, post-process).
- Vision-language models (VLMs) can do OCR end-to-end. The image goes in; the text comes out.
- DeepSeek-OCR is one of several VLMs purpose-tuned for document understanding — extracting text, tables, and structure from images.
- Beats classical OCR on messy / multi-language / handwritten / layout-heavy documents.
- Now part of a broader category: document AI, where models extract structure, not just text.
§ 00 · OCR, THEN AND NOWFrom specialized pipelines to VLMs
Classical OCROCR. Optical Character Recognition. The task of converting an image of text into the digital text. Classical OCR systems used separate stages: preprocessing (deskewing, binarization), text detection (finding text regions), and character/line recognition. Each stage was a separate model or hand-tuned algorithm. was a multi-stage pipeline: binarize the image, find text regions, recognize characters in each region, fix common errors. Tesseract, ABBYY, the cloud providers’ OCR APIs — all worked this way.
Classical OCR works well on clean printed text. It struggles on:
- Skewed scans, photos taken at angles.
- Handwriting.
- Tables (where layout matters as much as characters).
- Multi-language pages.
- Stylized fonts, low contrast, watermarks, stamps.
§ 01 · A VISION-LANGUAGE MODEL APPROACHEnd-to-end document understanding
A vision-language model (see the ViT and BERT lessons for the building blocks) takes an image and emits text. With the right training data, that text can include not just the document’s words but also its structure — “here’s the table, here’s the heading, this paragraph is a footnote.”
The end-to-end VLM approach has structural advantages:
- No staged error compounding.Classical OCR’s accuracy is the product of each stage’s accuracy. End-to-end avoids that compounding.
- Context-aware recognition. The model can use surrounding words to disambiguate uncertain characters — a feature classical OCR has only in weak forms.
- Layout-aware. The model sees the layout directly and can emit structured output (markdown, HTML, JSON) representing it.
§ 02 · WHAT DEEPSEEK-OCR CONTRIBUTESOne implementation, broader category
DeepSeek-OCRDeepSeek-OCR. A 2025 model from DeepSeek tuned specifically for document understanding — converting images of documents (including scans, photos, mixed languages, and layouts) into structured text output. Part of a broader category of VLM-based document AI. is DeepSeek’s entry in the document-AI space. Specific emphases:
- Multilingual. Strong on Chinese and major European languages, mixed-script documents.
- Layout-preserving. Outputs structured text (markdown, sometimes HTML) that reflects document structure.
- Table extraction. Specifically trained on tabular data — outputting markdown tables or JSON rather than stripped flat text.
- Math and formula support. LaTeX output for mathematical content.
Related models in the same category: Qwen-VL-OCR, GPT-4o-vision, Claude with vision input, Microsoft’s Florence-2, open models like Idefics, MiniCPM-V, InternVL.
§ 03 · USE CASES THE MODERN STACK UNLOCKSWhat you can build with this
- Invoice / receipt extraction. Parse a photo of a receipt into structured (item, amount) pairs. Classical OCR can do this; modern models do it well across formats without specific tuning.
- Contract digitization. Convert scanned legal documents to structured markdown for downstream search and RAG.
- Forms processing. Read filled-in paper forms with checkboxes, signatures, handwritten fields.
- Academic paper ingestion. Convert PDFs (including equations, tables, figures with captions) to markdown for LLM consumption. ArXiv-to-text pipelines that worked poorly with classical OCR work well now.
- Historical archives. Old print, mixed scripts, smudged copies. Modern VLMs are noticeably better than classical OCR on these.
§ 04 · LIMITS AND ADVERSARIAL CASESWhere this still struggles
- Latency. A VLM call is slower than a classical OCR pipeline. For high-volume batch processing, this matters.
- Hallucination. The same generative behavior that makes VLM OCR robust to fuzzy input can also produce confidently- wrong text where classical OCR would have failed openly. Confidence calibration is harder.
- Reproducibility. Two runs of the same document can produce slightly different output — the model is non-deterministic at default settings.
- Adversarial / unusual documents. Highly stylized designs, ASCII art, watermarked PDFs can still trip the model up.
§ 05 · TAKING THIS FORWARDWhere document AI is heading
Three trends worth following:
- Vision encoders in general LLMs.The line between “OCR model” and “LLM with vision” is blurring. By 2026 most production document workloads use the same multimodal model they use for everything else.
- Long-context VLMs. Send the whole multi-page PDF as image, get a structured answer back. No chunking, no per-page OCR.
- Smaller, faster variants. The high-volume economics push for SLM-class document models. DocLayoutNet, Florence-2-small, and similar smaller VLMs are the deployable end of the spectrum.
§ · GOING DEEPERVLM-based OCR and the move beyond reading text
Classical OCR engines (Tesseract, ABBYY) read characters and produce text. Vision-language models do something broader: read characters in context, recover layout, parse tables, extract structured fields, and answer questions about the document. DeepSeek-VL2 (2024) and contemporaries like Qwen2-VL (Bai et al. 2024) trained on document-heavy data and significantly outperform classical pipelines on complex layouts.
Two threads worth knowing. GOT-OCR2.0 (Wei et al. 2024) is purpose-built for general-purpose OCR with a unified end-to-end model — handles math, music, charts, formulas as well as text. Microsoft’s Florence-2 (2024) is a smaller VLM that handles OCR, detection, segmentation, and captioning with a single token-prefix conditioning scheme. The frontier is converging: extracting structure from documents is no longer a separate subfield, it’s a capability of general multimodal models.
§ · FURTHER READINGReferences & deeper sources
- (2024). DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding · arXiv
- (2024). Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution · arXiv
- (2024). General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model (GOT-OCR2.0) · arXiv
- (2024). LLaVA-OneVision: Easy Visual Task Transfer · arXiv
- (2024). Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.