Overview
- Ollama is presented as the preferred local runner, exposing an OpenAI‑compatible API at localhost:11434 so most IDE plugins can connect with no code changes.
- DeepSeek‑Coder‑V2 16B is the default pick for balanced speed and skill on machines with 16–32GB memory or a mid‑range NVIDIA GPU, with Apple Silicon laptops running even larger models via unified memory.
- A step‑by‑step checklist covers installing Ollama, pulling models like qwen2.5‑coder:7b or deepseek‑coder‑v2:16b, and running a quick local prompt to confirm everything works.
- Continue.dev in VS Code is the main interface for chat and inline autocomplete, with optional Open WebUI for longer architecture sessions and Aider for commit‑aware, multi‑file edits.
- Speed tips include using quantized weights for smaller loads, keeping the context window tight, pre‑warming the model to avoid first‑response delays, and enabling GPU offloading or flash attention where supported.