Particle: DeepSeek Releases Multimodal Model, Pitches 'Visual Primitives' for Spatial Reasoning

Overview

DeepSeek published a multimodal large model and a detailed technical report on GitHub for public review.
The report introduces Thinking with Visual Primitives, which embeds points and boxes into the model’s reasoning steps to tie each step to exact image coordinates.
The authors say this approach addresses a referential gap in spatial tasks where vague language fails to pinpoint locations, rather than only chasing higher image resolution.
DeepSeek reports parity on counting and spatial benchmarks with GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash, though the results still need independent validation.
The company has already offered an image-understanding mode in its product, signaling that these research ideas are moving into user-facing features.