Particle.news

DeepSeek Releases Multimodal Model, Pitches 'Visual Primitives' for Spatial Reasoning

The company says its compact system matches top rivals on key vision tests using far fewer image annotations.

Overview

  • DeepSeek published a multimodal large model and a detailed technical report on GitHub for public review.
  • The report introduces Thinking with Visual Primitives, which embeds points and boxes into the model’s reasoning steps to tie each step to exact image coordinates.
  • The authors say this approach addresses a referential gap in spatial tasks where vague language fails to pinpoint locations, rather than only chasing higher image resolution.
  • DeepSeek reports parity on counting and spatial benchmarks with GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash, though the results still need independent validation.
  • The company has already offered an image-understanding mode in its product, signaling that these research ideas are moving into user-facing features.