Overview
- DeepSeek published a multimodal large model and a detailed technical report on GitHub for public review.
- The report introduces Thinking with Visual Primitives, which embeds points and boxes into the model’s reasoning steps to tie each step to exact image coordinates.
- The authors say this approach addresses a referential gap in spatial tasks where vague language fails to pinpoint locations, rather than only chasing higher image resolution.
- DeepSeek reports parity on counting and spatial benchmarks with GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash, though the results still need independent validation.
- The company has already offered an image-understanding mode in its product, signaling that these research ideas are moving into user-facing features.