Ds4

ds4 is a local inference engine for DeepSeek models that includes a distributed runtime for splitting transformer layers across networked computers. It functions as a reasoning controller with a local weight streamer and an API server that streams chat completions via industry standard endpoints.

The system employs a memory management model that loads model experts from disk on demand to execute models that exceed available system RAM. It provides controls for reasoning effort and model behavior steering, allowing the modification of response characteristics through activation directions.

The project covers a broad capability surface including hardware acceleration for Metal, CUDA, and ROCm, and disk persistence for prompt states and agent sessions. It also includes tools for inference throughput benchmarking, model capability evaluation, and power consumption limiting to manage hardware heat.

Features

Local AI Inference - Runs DeepSeek models on local Metal, CUDA, or ROCm hardware for private generative AI tasks.

Activation Steering Vectors - Modifies model output characteristics by applying direction vectors to internal activations during inference.

Distributed Model Execution - Splits transformer layers across networked computers to execute models that exceed the memory of a single device.

Local Inference Runtimes - Executes deep learning models on local Metal, CUDA, or ROCm hardware for offline inference.

Local Inference Engines - Provides an optimized local execution environment for DeepSeek models with multi-GPU acceleration.

Model Steering Tools - Controls verbosity, safety, and reasoning depth using activation directions and thinking mode toggles.

Runtime Weight Loading - Loads specific model experts from disk into memory on demand to run models larger than available RAM.

Reasoning Mode Controllers - Adjusts internal thinking depth and modifies output characteristics through activation steering.

Unified GPU Backend Abstractions - Unifies hardware calls for Metal, CUDA, and ROCm to execute operations across different GPU architectures.

Model Weight Offloading - Manages memory by loading model experts from disk on demand to exceed available system RAM.

Distributed Runtimes - Orchestrates transformer layer execution across multiple networked computers to handle massive models.

Model Sharding - Splits transformer layers across multiple networked machines to run models exceeding single-device memory.

Expert Weight Streaming - Loads specific model experts from disk on demand to execute models that exceed available system RAM.

Agent Session Management - Saves and resumes interactive coding or chat sessions by caching prompt states and agent data.

Prefix Caching - Stores processed prompt prefixes on disk to avoid recalculating the initial input tokens upon session resumption.

Disk Caching Systems - Stores processed prompt prefixes on disk to allow session resumption without recalculating input tokens.

Compute Throttling - Inserts calibrated pauses between compute units to target specific power consumption levels and reduce hardware heat.

Agentic Session Persistence - Saves interactive session data to disk to allow resuming tasks without repeating the initial prompt prefill.

OpenAI-Compatible Servers - Implements the OpenAI API specification to stream completions via industry standard endpoints.

OpenAI-Compatible API Servers - Provides a local server with standard endpoints to stream chat completions to external applications.

Headless Server Hosting - Provides a headless API server to stream chat completions and messages via server-sent events.

AI Agents - Local inference engine for running models on MacBook.

antirezds4

Features

Star history