ds4 is a local inference engine for DeepSeek models that includes a distributed runtime for splitting transformer layers across networked computers. It functions as a reasoning controller with a local weight streamer and an API server that streams chat completions via industry standard endpoints.
The system employs a memory management model that loads model experts from disk on demand to execute models that exceed available system RAM. It provides controls for reasoning effort and model behavior steering, allowing the modification of response characteristics through activation directions.
The project covers a broad capability surface including hardware acceleration for Metal, CUDA, and ROCm, and disk persistence for prompt states and agent sessions. It also includes tools for inference throughput benchmarking, model capability evaluation, and power consumption limiting to manage hardware heat.