30 open-source projects similar to cocktailpeanut/dalai, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Dalai alternative.
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
Mistral Inference is a library for running Mistral large language models on a GPU, generating text from prompts with token streaming. It loads pretrained model weights from local disk or a remote registry into GPU memory, then produces output tokens one by one for real-time display in interactive applications. The library supports multimodal prompts that accept image URLs alongside text, enabling visual description and reasoning. It includes content safety guardrails that scan generated text against predefined policies to block or flag policy violations. For structured interactions, it provid
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
quant-wiki is a comprehensive knowledge base and structured reference for quantitative finance, financial engineering, and algorithmic trading. It serves as a centralized library of documentation covering mathematical models, financial instruments, and systematic trading strategies. The project integrates AI-driven capabilities through a modular retrieval-augmented generation framework that extracts structured data from research papers and news. It features a multi-agent workflow engine designed to discover and validate predictive alpha factors, alongside tools for local large language model
llamafile is a model bundler and local runtime that packages large language models and their execution logic into single, portable executable files. It provides a distribution format for zero-installation local execution, allowing users to run models on various operating systems without managing external library dependencies or environment configurations. The project differentiates itself by bundling model weights and the runtime into one self-extracting binary. This approach simplifies the distribution of AI models, as the combined file contains everything necessary to run the model immediat
omlx is a local inference server designed to run large language models, vision models, and embedding models on Apple Silicon. It provides a private alternative to industry-standard AI endpoints by hosting a local API gateway that mirrors OpenAI and Anthropic specifications. The system distinguishes itself through specialized hardware optimizations, including continuous batching for high throughput and a tiered caching system that offloads memory blocks to SSD. It also functions as a Model Context Protocol host, enabling the integration of local models with external tools, agents, and structur
Llama-GPT is a self-hosted generative AI model runner that provides a private web interface for interacting with large language models. By executing these models directly on local hardware, it ensures that all intelligent assistance remains offline and independent of external cloud service providers. The project functions as a private assistant that maintains complete data ownership by storing all application state and model interactions on local storage volumes. It is designed to operate within a broader self-hosted computing environment, allowing users to maintain control over their persona
Page Assist is a browser-based AI integration tool that provides a sidebar interface for interacting with AI models while browsing the web. It focuses on privacy-focused chatting and web content analysis, allowing users to extract and query information from active webpages to receive context-aware responses. The project distinguishes itself through local AI integration, enabling connections to locally hosted models or private API endpoints to process data without relying on cloud services. It also supports collaborative AI conversations via public sharing links or self-hosted sharing infrastr
ChatGLM-6B is an open-source bilingual large language model designed for natural dialogue and text generation in both English and Chinese. It is structured as a dialogue model capable of tasks such as role-playing and information extraction. The project provides implementations for quantized language models, using low-precision weights to reduce GPU memory requirements for local inference. It also supports parameter-efficient fine-tuning, allowing model behavior to be optimized for specific tasks without requiring full retraining. The model includes capabilities for local execution on GPUs a
KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models. The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, im
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
This project is an on-device AI SDK providing a framework for running large language models, vision models, and speech models locally. It serves as an orchestration layer for local LLM execution, ensuring data privacy and offline availability by utilizing hardware acceleration on the device. The SDK is distinguished by its comprehensive voice and multimodal capabilities, including a coordinated voice pipeline for activity detection, speech-to-text, and text-to-speech synthesis. It also provides a dedicated implementation kit for local retrieval-augmented generation and tools for processing co
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
WebLLM is a library for executing large language models directly within web browsers. It provides a framework for building conversational artificial intelligence applications that perform inference locally, ensuring user data privacy by eliminating the need for external server dependencies. The project distinguishes itself by leveraging browser-native graphics APIs to perform intensive machine learning computations on the client side. It maintains application responsiveness by offloading heavy model tasks to background threads and ensures continuous operation through service workers that func
This is an open-source Python SDK for building and orchestrating production-grade AI agents. It provides a unified framework for creating conversational agents that can use tools, maintain state, and coordinate across multiple language model providers including OpenAI, Anthropic, Google, Amazon Bedrock, and locally-hosted models. The SDK supports multi-agent orchestration through graphs, teams, and swarms, allowing several specialized agents to collaborate on complex tasks. Agents can be composed as callable tools that other agents invoke, and the framework includes policy handlers that inspe
Kilocode is an autonomous engineering platform designed to orchestrate AI agents for complex software development tasks. It functions as a comprehensive system for automating coding, testing, and repository management by integrating directly with your codebase and terminal. The platform provides a unified gateway for model orchestration, allowing for the management of agentic workflows, event-driven automation, and persistent session state across distributed development environments. The platform distinguishes itself through its federated task management and policy-based access control, which
localGPT is a private AI knowledge base and retrieval-augmented generation application. It provides a local document indexer, a hybrid search engine, and an inference interface to enable chatting with private documents and managing a self-hosted information repository without sending data to external servers. The system distinguishes itself through a dual-pass verification pipeline that ensures generated answers are grounded in retrieved sources, accompanied by explicit source attribution. It employs a hybrid retrieval approach combining semantic vector search with keyword matching and rerank
WhisperLiveKit is a real-time speech-to-text server that transcribes streaming audio into text with ultra-low latency using Whisper models. It serves transcription capabilities through REST endpoints and WebSocket connections, enabling external applications to send audio and receive transcriptions as words are spoken, making it suitable for live captioning or voice interfaces. The project distinguishes itself by combining real-time transcription with speaker diarization, assigning transcribed words to individual speakers during live audio streams for meeting or interview transcripts. It also
VirtualWife is a framework for creating interactive 3D digital companions powered by large language models. It integrates a browser-based rendering engine that synchronizes 3D model animations and facial expressions with AI-generated dialogue in real time, supported by a voice interaction system that converts text into synthesized speech. The system features a persona manager for defining role-play prompts, visual identities, and long-term conversational memory. It also includes a bridge for live streaming integration, allowing an AI avatar to interact with live audiences by monitoring commen
Serge is a self-hosted web chat interface for running large language models locally using the llama.cpp inference engine. It loads GGUF-format model files directly on your own machine, removing the need for internet connectivity or external API keys, and streams responses to the browser in real time via WebSocket connections. The project is packaged for containerized deployment using Docker and Docker Compose, with a Traefik reverse proxy that handles HTTP and WebSocket routing along with automatic TLS certificate management. Ready-made Kubernetes manifests are also provided, enabling deploym
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Shell GPT is an AI-powered command-line interface that generates shell commands and source code from natural language prompts. It serves as a terminal-based tool for automating technical tasks, producing executable commands, and generating code snippets directly within the shell. The tool distinguishes itself through a read-eval-print loop for interactive chatting and the ability to maintain stateful conversational history via named sessions. It supports flexible backend routing, allowing users to connect to cloud-based APIs or local language model hosts for offline operation and data privacy
node-DeepResearch is an autonomous web research engine that uses large language models to iteratively search, read, and reason over web content to answer complex questions. It provides a chat-based interface that displays real-time reasoning steps and final answers, and can be configured to focus exclusively on academic papers by limiting searches to academic repositories. The research engine operates through an agentic search-read-reason loop that repeatedly searches, reads, and reasons until a stopping condition is satisfied. It enforces a token budget to cap total consumption and failed at
This project is a containerized development stack and application framework for building retrieval-augmented generation systems. It provides a dockerized AI sandbox that integrates local model runtimes, knowledge graphs, and vector stores to enable the creation of contextual chatbots. The stack is distinguished by its graph-based vector store, which combines structured knowledge graphs with vector indices for both semantic and structural data retrieval. It allows for local model hosting with CPU or GPU acceleration, enabling generative tasks without reliance on external cloud APIs. The frame
picoGPT is a lightweight, low-level runtime environment and inference engine designed to load pre-trained checkpoints and execute generative transformer model inference. It provides a minimal implementation of the generative pre-trained transformer architecture to facilitate local language model execution. The project includes a C++ machine learning library for converting model parameters and executing greedy token generation without heavy external dependencies. It handles remote asset synchronization by downloading pre-trained weights, hyperparameters, and vocabulary files from remote server
Jan is a local language model desktop application and AI assistant orchestrator. It provides a unified interface for interacting with both resident models and remote cloud AI providers. The project functions as a host for the Model Context Protocol, connecting AI models to external tools and data sources. It also operates as an OpenAI compatible API server, exposing local models through a standardized server endpoint for other applications to query. The system supports the creation of specialized AI personas with custom instructions and allows for the management of hybrid model environments,
ollama-python is a Python client for interacting with large language models. It provides an interface for sending prompts to receive text and chat completions, as well as a dedicated client for generating numerical vector embeddings from text. The project includes a wrapper that emulates the OpenAI API, allowing applications built for that standard to interact with local models. It also provides a non-blocking asynchronous client for executing concurrent requests. The library covers the full model lifecycle, including the ability to pull, create, list, and delete models within a local enviro
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
The free AI already on your Mac. CLI tool, OpenAI-compatible server, and interactive chat — all on-device via Apple Intelligence. No API keys, no cloud, no downloads.