30 open-source projects similar to algorithmicsuperintelligence/optillm, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Optillm alternative.
OptiLLM is an inference proxy and gateway router that directs prompts to specific language models based on cost, performance, and provider health. It functions as a middleware layer designed to optimize requests through intelligent routing, load balancing, and context management. The project provides specialized capabilities for data protection by anonymizing personally identifiable information before requests reach a model. It also acts as a reasoning orchestrator and tool integration layer, using inference-time loops and self-reflection to improve accuracy while connecting models to externa
omlx is a local inference server designed to run large language models, vision models, and embedding models on Apple Silicon. It provides a private alternative to industry-standard AI endpoints by hosting a local API gateway that mirrors OpenAI and Anthropic specifications. The system distinguishes itself through specialized hardware optimizations, including continuous batching for high throughput and a tiered caching system that offloads memory blocks to SSD. It also functions as a Model Context Protocol host, enabling the integration of local models with external tools, agents, and structur
Langroid is a multi-agent orchestration framework and tool integration suite designed for building complex AI applications. It serves as a multi-modal integration layer that connects diverse local and remote language models with an agentic retrieval-augmented generation system. The project distinguishes itself through a collaborative message-exchange paradigm, allowing specialized agents to delegate tasks hierarchically and coordinate via structured communication. It features an advanced state management system for conversational AI, including the ability to rewind and prune conversation hist
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
LiteRT-LM is a high-performance inference framework designed to execute large language models locally on mobile, desktop, and IoT hardware. It serves as an on-device model runtime that utilizes CPU, GPU, and NPU acceleration to provide low-latency processing. The framework is distinguished by its ability to process text, vision, and audio inputs through a single multi-modal inference engine. It features a local HTTP server that emulates OpenAI-compatible API endpoints and a WebGPU-based runtime for executing models directly within a web browser. To ensure output reliability, it includes a con
gpt4free-ts is a TypeScript-based LLM API proxy and gateway that provides a unified interface for accessing large language models without paid subscriptions or official API keys. It functions as a containerized AI bridge that routes requests to various free third-party providers to retrieve chat completions. The project acts as an OpenAI API wrapper, translating requests and responses into the standard OpenAI chat completions format to ensure compatibility with existing AI tools. It utilizes a provider-based routing system to distribute request loads across available endpoints. The gateway s
gpt-load is a transparent proxy gateway that routes API requests to multiple AI providers—including OpenAI, Google Gemini, and Anthropic Claude—through a single endpoint while preserving each provider's native format and authentication. It acts as a centralized routing layer, allowing applications to switch between AI services by changing only the base URL without modifying any client code or business logic. The proxy distinguishes itself through intelligent traffic management across pools of API keys, offering automatic key rotation, weighted or round-robin load balancing, and failover that
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe
Presidio is a PII detection and anonymization framework designed to identify and mask personally identifiable information in text. It functions as a PII recognition pipeline and a data masking engine, using a combination of machine learning, regular expressions, and rule-based logic to locate sensitive entities. The system acts as an NER model orchestrator, allowing for the integration of external named entity recognition models and PII detectors to support multi-language privacy scrubbing. It employs a plugin-based recognizer architecture that can be extended with custom recognizers, deny-li
PrivateGPT is a private AI document assistant and local knowledge base manager designed for querying private files and documents using retrieval-augmented generation. It functions as a local language model application and API gateway, allowing users to obtain cited answers from unstructured data without sending information to external servers. The system differentiates itself by acting as a tool integrator that connects language models to external functions, including web search, tabular data analysis, and custom action extensions. It provides a standardized API layer that allows local infere
Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device. The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
Poml is a prompt management framework and templating engine designed for authoring, versioning, and rendering structured prompts for large language models. It uses a semantic markup language to organize prompts into reusable templates, combining them with dynamic context and data to generate formatted inputs. The system distinguishes itself by decoupling core prompt logic from final presentation through a stylesheet-based approach. It provides a dedicated JSON schema output generator to enforce strict, machine-parsable model responses and a configuration interface for managing function tool s
This project is a framework for developing multimodal AI agents that function as programmable participants in real-time communication rooms. It enables the construction of agents that can see, hear, and speak by integrating speech-to-text, large language models, and text-to-speech pipelines to facilitate low-latency, natural conversations. The system is distinguished by its advanced orchestration of real-time media and conversational flow, including support for full-duplex speech, preemptive response generation, and sophisticated interruption management. It further differentiates itself throu
The agent-framework is an LLM agent orchestration framework and multi-agent workflow engine designed for building autonomous AI agents. It provides a tool integration layer for binding external functions, APIs, and sandboxed code as executable tools for language models. The framework distinguishes itself through a graph-based system for designing sequential and parallel task flows, featuring state management and checkpointing for long-running processes. It implements comprehensive conversational state management and an observability suite that uses telemetry to trace execution flows and monit
lollms-webui is a web-based user interface and local AI model orchestrator designed for interacting with and managing large language models and multimodal AI on local hardware. It functions as a generative AI multimedia suite that enables the creation of text, images, video, and music through integrated diffusion and language models. The project features a dedicated persona manager to configure behavioral profiles and distinct personalities, controlling the style and tone of model responses. It includes a local memory system for maintaining long-term conversation context and chat history via
AG2 is a multi-agent large language model orchestration framework, agentic workflow automation tool, and RAG-enabled agent platform. It functions as a communication protocol and framework for coordinating multiple AI agents to solve complex tasks through shared state and standardized messaging. The project distinguishes itself through flexible coordination strategies, including hierarchical agent organization, hub-and-spoke models, and dynamic routing that analyzes conversation context to distribute work. It implements multi-stage feedback loops for iterative refinement and uses schema-constr
node-DeepResearch is an autonomous web research engine that uses large language models to iteratively search, read, and reason over web content to answer complex questions. It provides a chat-based interface that displays real-time reasoning steps and final answers, and can be configured to focus exclusively on academic papers by limiting searches to academic repositories. The research engine operates through an agentic search-read-reason loop that repeatedly searches, reads, and reasons until a stopping condition is satisfied. It enforces a token budget to cap total consumption and failed at
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
This project is a Python framework for building autonomous, event-driven agent systems. It provides a unified runtime for orchestrating multi-agent workflows, managing persistent conversation state, and executing code within secure, isolated sandbox environments. The framework is designed to handle complex task delegation, allowing agents to invoke other agents as tools while maintaining context across multi-turn interactions. The framework distinguishes itself through its deep integration with the Model Context Protocol, enabling agents to connect to external data sources and remote services
LocalAI is a local generative AI platform and inference engine designed to host large language, vision, and audio models on private hardware. It functions as an API compatible gateway that mimics proprietary service endpoints, allowing existing third-party software to integrate with a self-hosted backend. The platform distinguishes itself as a distributed AI model orchestrator, capable of scaling inference across machine clusters using VRAM-aware routing and hardware coordination. It provides a unified interface for diverse open-source backends and supports self-hosted RAG infrastructure thro
Headroom is an AI gateway proxy and token optimizer designed to reduce the cost and latency of large language model interactions. It functions as an intermediary that intercepts traffic between clients and providers to apply context compression, request routing, and format translation. The system differentiates itself through a Model Context Protocol server implementation that delivers compression and retrieval tools to compatible AI hosts. It employs a content-aware compression pipeline and tiered importance scoring to trim redundant data from logs and tool outputs while preserving essential
PydanticAI is a Python framework designed for building production-grade autonomous agents. It provides a unified interface for interacting with diverse language models, enabling developers to construct agents that perform complex tasks through structured data validation, tool execution, and multi-turn conversation management. The library centers on type-safe schema enforcement, ensuring that model inputs and outputs remain consistent and reliable throughout the agent's lifecycle. The framework distinguishes itself through a robust architecture that emphasizes modularity and testability. It ut
AIOS is an LLM agent operating system and orchestration kernel designed to manage memory, resource scheduling, and tool execution for multiple autonomous AI agents. It serves as a comprehensive framework for developing and deploying agents, featuring a dedicated resource manager that coordinates model backends, GPU memory, and isolated kernel instances. The system distinguishes itself through a semantic memory engine that uses vector search and autonomous clustering for long-term knowledge management, and a semantic file system that allows users to control computer files and system operations
This repository is a collection of guides, notebooks, and recipes for implementing advanced prompting techniques and workflow patterns with large language models. It serves as a prompt engineering guide, an evaluation suite for scoring prompt quality, and a framework for orchestrating agents and integrating external tools. The project provides implementation patterns for building applications with Claude, specifically focusing on coordinating multiple models to split complex tasks between high-reasoning and high-efficiency agents. It includes technical demonstrations for multimodal data proce
PraisonAI is an autonomous AI agent platform that coordinates multiple LLM-powered agents for research, planning, and execution of complex workflows. It functions as a multi-agent orchestration framework, a workflow builder, and a Model Context Protocol server, while also providing retrieval-augmented generation through vector knowledge bases. Agents can interact via CLI, web, or standardized protocols with sandboxed code execution. The platform distinguishes itself with a rich set of agent communication protocols, including A2A, REST, WebSocket, voice and telephony integration, and MCP, allo