30 open-source projects similar to mistralai/mistral-inference, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Mistral Inference alternative.
The official PyTorch implementation of Google's Gemma models
GLM-4 is a large language model and fine-tuning framework designed for human-like text production, complex reasoning, and multilingual conversation. It functions as a multimodal system capable of processing high-resolution visual content and as a long-context model designed to analyze documents with a context window of up to one million tokens. The project differentiates itself through a function calling interface that enables AI agent development by connecting the model to external APIs and real-time web browsing. It includes specialized capabilities for generating functional programming cod
This project is a large language model inference library and framework designed to run models for text generation, problem solving, and coding assistance. It includes a multimodal framework for processing combined image and text inputs and a tool-use implementation that enables the execution of external functions based on model reasoning. The system features a distributed GPU inference engine that spreads large model workloads across multiple graphics processors to increase processing speed and meet memory requirements. It also provides containerized model deployment through pre-packaged imag
This is an asynchronous Swift client library for calling OpenAI’s API across Apple platforms. It provides native access to chat completions, image generation and editing, speech synthesis and transcription, text embeddings, and content moderation through a single interface built on Swift’s async-await concurrency model. The client supports structured output generation by constraining model responses to a provided JSON schema, and enables real-time consumption of generated text through streaming responses delivered as an AsyncSequence. It includes a thread-based conversation model for managing
The simplest way to run LLaMA on your local machine
This is an open-source Python SDK for building and orchestrating production-grade AI agents. It provides a unified framework for creating conversational agents that can use tools, maintain state, and coordinate across multiple language model providers including OpenAI, Anthropic, Google, Amazon Bedrock, and locally-hosted models. The SDK supports multi-agent orchestration through graphs, teams, and swarms, allowing several specialized agents to collaborate on complex tasks. Agents can be composed as callable tools that other agents invoke, and the framework includes policy handlers that inspe
picoGPT is a lightweight, low-level runtime environment and inference engine designed to load pre-trained checkpoints and execute generative transformer model inference. It provides a minimal implementation of the generative pre-trained transformer architecture to facilitate local language model execution. The project includes a C++ machine learning library for converting model parameters and executing greedy token generation without heavy external dependencies. It handles remote asset synchronization by downloading pre-trained weights, hyperparameters, and vocabulary files from remote server
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
gpt-neox is a distributed training system and framework for building large-scale autoregressive language models. It implements the transformer architecture and provides a toolkit for training models with billions of parameters by distributing weights across compute clusters. The framework distinguishes itself through extensive support for distributed model parallelism, including pipeline and sequence parallelism, to overcome single-device memory limits. It further supports sparse model architectures using a mixture of experts system with Sinkhorn-based routing. The project covers a broad ran
This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks. The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
Llama is a large language model runtime and inference engine designed to load and execute autoregressive transformer models. It enables the generation of natural language text completions from prompts using pretrained weights. The system features multi-GPU model parallelism, which distributes model weights and workloads across multiple graphics processors to support larger parameter counts. It also incorporates a content safety filter that uses classifiers to intercept and block unsafe inputs or outputs during the inference process. The project covers broad capabilities in distributed model
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
Guardrails is a Python SDK that wraps calls to large language models with configurable validation pipelines, corrective actions, and structured output generation. It provides a unified API layer that connects to over 100 language models, applying consistent validation, streaming, and error-handling across providers. The framework validates and corrects model responses against safety and quality rules, detecting and mitigating risks in both inputs and outputs using pre-built and custom validators. The project distinguishes itself through a validator-pipeline architecture that sequentially appl
LiteLLM is a unified gateway and proxy server designed to centralize access to over one hundred language model providers. It provides a standardized API interface that abstracts vendor-specific schemas, allowing developers to interact with diverse models through a single, consistent format. By acting as a central traffic management layer, it enables organizations to route, secure, and govern model interactions across multiple deployments. The platform distinguishes itself through its policy-driven architecture, which uses configuration-based routing to manage traffic distribution, load balanc
Archgw is a gateway proxy and data plane designed for agentic applications, providing a centralized layer for routing, safety, and orchestration between application logic and multiple large language model providers. It functions as an AI agent orchestrator that automates the execution of agent workflows to remove repetitive plumbing from the core codebase. The system features a provider-agnostic interface layer that standardizes disparate model APIs into a single format and a transparent proxy data plane to intercept traffic. It employs rule-based routing to decouple application logic from sp
This is a Backend as a Service SDK for Apple platforms, providing a collection of libraries that connect iOS and macOS applications to cloud databases, authentication services, and serverless infrastructure. It serves as a developer kit for integrating real-time data synchronization, file storage, and push notifications into native apps. The SDK is distinguished by its generative AI integration, which routes text and multimodal prompts between on-device models and cloud-hosted large language models. It further differentiates itself with a specialized app distribution tool for managing pre-rel
Claude Code is a command-line interface and multi-agent orchestration framework designed for autonomous software engineering. It enables AI agents to perform codebase modifications, debugging, and Git workflow management while coordinating multiple specialized agents to decompose and execute complex engineering tasks in parallel. The system distinguishes itself through a high degree of isolation and safety, utilizing Git worktrees to create independent working directories for concurrent agents and implementing a tiered permission system that combines user rules, project policies, and OS-level
Baichuan-7B is an open-source 7 billion parameter bilingual Transformer model designed for text generation and few-shot learning across Chinese and English. It is built on a large Transformer architecture trained on a bilingual corpus, enabling it to produce coherent text in both languages from a single model. The model incorporates several optimization techniques that distinguish it from standard large language models. It uses rotary position embeddings that can extrapolate to longer sequences than seen during training, allowing context extension beyond the original 4096-token training lengt
Llama 3 is a collection of pretrained, autoregressive transformer-based models designed for natural language generation, reasoning, and complex instruction following. It functions as a generative AI framework that provides the infrastructure for managing model weights, executing neural network inference, and handling computational workloads across diverse knowledge domains. The project distinguishes itself through an integrated AI safety toolkit that employs secondary classification filtering to inspect inputs and outputs, ensuring adherence to usage compliance and safety standards. It suppor
This project is a manual reconstruction of the Llama 3 transformer architecture implemented as a PyTorch neural network. It serves as a reference for the internal mathematical structure and tensor flow of a transformer-based language model designed for next token prediction. The implementation focuses on building the model from scratch using basic matrix operations and tensor manipulations. It demonstrates the manual construction of core components, including rotary positional embeddings, multi-head self-attention, and root mean square normalization. The codebase covers the full inference pi
Qwen is a comprehensive framework for large language model development, serving, and deployment. It provides a complete ecosystem for transformer-based sequence modeling, offering base models alongside specialized tools for instruction-tuned alignment, fine-tuning, and long-context inference. The project is designed to support both research and production environments, enabling users to train, optimize, and host generative models locally or across distributed hardware. The framework distinguishes itself through its focus on high-performance serving and extensibility. It features a high-perfor
InternLM is a large language model and a comprehensive suite of weights designed for text generation and complex reasoning. It functions as an inference engine for serving responses, a fine-tuning framework for adjusting model weights, and a platform for building autonomous AI agents. The system is capable of processing long-context input sequences up to one million tokens for document analysis. It employs chain-of-thought reasoning to solve knowledge-intensive tasks by generating intermediate logic steps before producing a final answer. The project covers model weight optimization through s
Qwen3 is a transformer-based large language model designed as a generative AI foundation for understanding, reasoning, and generating human language. It functions as a comprehensive ecosystem for model training, fine-tuning, and production-ready inference, providing the underlying architecture and weights necessary to build diverse artificial intelligence applications. The project distinguishes itself through extensive support for model quantization and distributed inference, enabling efficient execution across a wide range of hardware from consumer-grade devices to scalable cloud infrastruct
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
This project is a standardized framework for benchmarking large language models across a wide range of academic and reasoning datasets. It provides a platform for executing automated evaluation tasks to measure model accuracy and performance, ensuring consistent assessment through a structured configuration schema. The framework distinguishes itself by incorporating a dedicated utility for data decontamination, which identifies and removes overlapping training samples from evaluation sets to prevent data leakage. It also features a flexible task builder that allows users to define custom benc
Llama2.c is a minimal inference engine designed to execute transformer-based language models using only standard C code. By implementing neural network forward passes without external dependencies or complex runtime environments, it provides a lightweight execution environment for running pre-trained models. The project distinguishes itself through a focus on portability and resource efficiency. It utilizes static memory allocation to avoid dynamic heap management and maps model parameter files directly into the process address space to minimize memory overhead. The implementation relies on s