30 open-source projects similar to kyutai-labs/moshi, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Moshi alternative.
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
The nexa-sdk is an on-device AI SDK and multimodal inference engine designed to run large language, vision, and audio models locally on mobile and desktop hardware. It functions as a local LLM runtime and NPU acceleration framework, enabling the execution of generative and discriminative models without reliance on cloud services. The project distinguishes itself through a dedicated NPU acceleration framework that optimizes model execution on Neural Processing Units to reduce latency and power consumption. It employs hardware-agnostic backend routing to dynamically distribute computations acro
LitGPT is a training and deployment framework for large language models, providing a suite of tools for pretraining, finetuning, quantizing, evaluating, and serving models within a production environment. It includes a dedicated training pipeline for adapting pretrained models to specific tasks, a quantization tool for reducing weight precision, and an inference server for hosting models via web interfaces. The framework supports high-performance model development through custom architecture implementation and the use of predefined recipes to standardize pretraining and finetuning. It enables
This project is a framework for developing multimodal AI agents that function as programmable participants in real-time communication rooms. It enables the construction of agents that can see, hear, and speak by integrating speech-to-text, large language models, and text-to-speech pipelines to facilitate low-latency, natural conversations. The system is distinguished by its advanced orchestration of real-time media and conversational flow, including support for full-duplex speech, preemptive response generation, and sophisticated interruption management. It further differentiates itself throu
Personaplex is an LLM speech-to-speech framework and conversational AI persona engine designed for real-time voice interfaces. It provides a system for defining AI identities and vocal characteristics through a combination of text-based role prompts and audio reference files. The project features a real-time AI voice interface that supports full-duplex human-AI dialogue, enabling multiple parties to speak and listen simultaneously via bidirectional audio streaming. It includes a GPU-accelerated audio processor and a speech-to-speech pipeline to facilitate low-latency conversations. The frame
ChatGLM2-6B is a bilingual chat large language model designed for natural conversation and text generation in both English and Chinese. It functions as a fine-tunable language model that supports updating weights via specialized scripts to adapt to specific datasets and tasks. The project serves as a quantized inference engine and multi-GPU model orchestrator, enabling the execution of large models on consumer-grade hardware. It is capable of processing long context sequences up to 32K tokens to maintain understanding across extended documents. The system covers capabilities for multilingual
Tengine is a suite of tooling and a lightweight execution engine designed for running deep learning models on constrained embedded hardware. It provides an infrastructure for converting neural network models, quantizing weights, optimizing operator kernels, and benchmarking inference performance across CPU, GPU, and NPU units. The project features an automated operator kernel optimizer to generate high-efficiency kernels and a model quantization tool that reduces precision to integer formats to lower memory usage. It includes a dedicated hardware benchmarking tool to evaluate the execution sp
whisper.cpp is a C++ implementation of the Whisper speech-to-text model, serving as a lightweight machine learning inference engine and quantized runtime. It provides high-performance automatic speech recognition and real-time audio transcription without requiring a Python environment. The project utilizes model quantization to reduce memory usage and increase inference speed on local hardware. It incorporates hardware acceleration to optimize processing speed across different processors. The system covers audio processing capabilities including voice activity detection, speaker diarization,
Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency. The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model executi
alpaca.cpp is a high-performance local inference engine implemented in C++ for executing instruction-tuned large language models. It serves as a quantized model runtime designed to load and run model tensors on local hardware with minimal dependencies, removing the requirement for a full Python environment. The project focuses on on-device text generation and the deployment of private AI chatbots. It utilizes model weight quantization to reduce memory requirements and increase inference speed on consumer-grade devices. The system covers hardware-optimized model execution through thread-pool
gpt-oss is an open-weight large language model and reasoning engine designed for complex reasoning and agentic workflows. It functions as an AI agent framework and model serving API, allowing for local deployment and the hosting of standardized interfaces to expose model completions and internal reasoning processes. The project distinguishes itself as a quantized inference engine, utilizing tensor parallelism and weight quantization to run high-parameter models on limited hardware. It features a reasoning model that employs chain-of-thought processing to solve multi-step logical tasks. The s
MOSS is a conversational AI platform, fine-tuning toolkit, and quantized model runtime. It provides a framework for deploying large language models capable of multi-turn dialogue, general-purpose response generation, and following complex instructions. The system functions as a tool-augmented framework that extends model knowledge through external plugins and tool-call loops. This allows the model to execute tasks via search engines and calculators to augment responses with external data. The project covers model training through supervised conversational fine-tuning and optimizes deployment
This project is a multimodal translation framework and large language model capable of speech-to-speech, speech-to-text, and text-to-text translation across nearly 100 languages. It provides a real-time speech translation engine and a comprehensive toolkit for converting spoken audio between languages. The system is distinguished by its ability to preserve the original speaker's tone, pace, and prosody during translation. It utilizes a specialized on-device inference toolkit that converts model checkpoints into C-based libraries, enabling low-latency execution on mobile and edge hardware with
Ten Framework is a multimodal large language model agent framework designed for building low-latency conversational agents. It integrates voice, text, and visual inputs in real time to facilitate human interaction. The project includes a real-time speech processing pipeline for streaming transcription, voice activity detection, and speaker diarization. It also features an avatar synchronization engine that coordinates character lip animations and visual outputs with synthesized speech. The framework covers edge AI deployment through containerized packaging and direct integration with embedde
OpenLLM is a framework for deploying, managing, and scaling open-source large language models
Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device. The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
MOSS is a conversational AI API server and framework designed to manage stateful multi-turn dialogues via session identifiers for remote interaction. It functions as a tool-augmented language model framework and a quantized inference engine. The project integrates external plugins, such as search engines and calculators, to provide factual and computed data within model responses. It also includes a supervised fine-tuning toolkit for adapting base language models to specific conversational datasets and behavioral instructions. The system supports inference optimization through 4-bit and 8-bi
This project is a framework for running Stable Diffusion image generation models on Apple Silicon using Core ML hardware acceleration. It provides a local generative AI pipeline for producing images from text prompts using Swift and Python without relying on external cloud APIs. The system includes a model converter to transform deep learning checkpoints into Core ML formats and a model optimizer to quantize weights and activations. It features a ControlNet integration layer to guide image generation using external signals such as edge and depth maps. Capabilities cover text-to-image generat
BELLE is a specialized implementation of Chinese conversational large language models, encompassing a full instruction tuning framework. It provides a pipeline for training, evaluating, and deploying models optimized for natural language understanding and dialogue tasks in the Chinese language. The project is distinguished by its integrated approach to model refinement, combining the curation of multi-million entry instruction datasets with a distributed training pipeline. This pipeline supports both full fine-tuning and low-rank adaptation to optimize conversational performance. The system
gpt-fast is a PyTorch transformer inference engine designed for text generation using a native tensor library implementation. It provides a runtime for executing large language models without the need for external C++ extensions. The project implements speculative decoding to accelerate generation by using a small draft model for token prediction and a larger model for verification. It further optimizes performance through a compiled prefill stage and a multi-GPU tensor parallelism library that shards linear layers across multiple graphics processing units. Memory efficiency is managed throu
This project is a comprehensive Node.js software development kit designed for integrating large language models into applications. It serves as a foundational client for interacting with REST and WebSocket services, enabling developers to implement chat functionality, multimodal content generation, and autonomous agent orchestration. The library provides a structured framework for defining executable tools and enforcing JSON schemas, ensuring that model outputs remain programmatically compatible with downstream systems. The SDK distinguishes itself through its robust request orchestration and
LlamaFactory is a unified framework for fine-tuning and adapting large language models. It provides a comprehensive platform that standardizes training workflows across diverse machine learning architectures, allowing users to execute both full-tuning and parameter-efficient methods through a single interface. The project distinguishes itself by offering a low-code visual dashboard that enables users to configure experiments and monitor performance metrics in real time without writing extensive custom scripts. It also features a configuration-driven orchestration system that decouples experim
LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections. The platform distinguishes itself through it
llama-rs is a local large language model inference engine implemented in Rust. It enables the execution of model computations on local hardware to generate text responses from user prompts. The project utilizes Rust-based tensor operations and direct-memory model mapping to handle high-performance linear algebra and efficient weight loading. It incorporates weight quantization to reduce the memory footprint of models by converting high-precision weights into smaller formats. The system includes a command-line interface for interactive chat sessions and one-off prompts, along with file-backed
This repository provides a collection of reference implementations and code examples for training and deploying machine learning models using the MLX framework. It serves as a practical guide for executing distributed training, fine-tuning large language models, converting model weights, and implementing multimodal generative workflows. The project distinguishes itself through specialized examples for local hardware execution, featuring weight quantization to reduce memory usage and low-rank adaptation for parameter-efficient fine-tuning. It also includes scripts for transforming external mod
Nexent is an enterprise AI control plane and LLM agent orchestration platform. It provides a zero-code environment for designing, deploying, and managing production AI agents through a multi-agent collaboration framework that coordinates specialized autonomous agents using standardized messaging protocols. The platform integrates the Model Context Protocol to connect agents with external tools, plugins, and services via a universal communication interface. It further distinguishes itself with a dedicated RAG knowledge base manager that imports unstructured documents and utilizes hybrid search
Lingbot-world is an interactive world simulator and framework for generating high-fidelity video environments from text and image prompts. It functions as a video generation system designed to create controllable simulations for applications such as robotics learning and gaming. The project includes a video motion controller that directs camera and object movement using transformation matrices and action strings. It utilizes a quantized inference engine to reduce memory usage and accelerate the generation of video sequences. The system covers a range of optimization techniques, including fou
This project is a collection of implementation guides, recipes, and developer resources for building applications with Llama models. It serves as a comprehensive kit for developing autonomous agents, establishing retrieval-augmented generation systems, and executing model fine-tuning. The resource provides specific patterns for multimodal workflows that process text, images, and audio. It includes specialized guidance on adapting pre-trained model weights for targeted tasks and implementing tool-calling orchestration to connect models with external APIs and functions. The codebase covers a b
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
Qwen-Image is a text-to-image model and large language model image generation framework. It functions as an AI image editing suite and a personalized image trainer, capable of producing high-fidelity visuals and accurate typography from natural language descriptions. The system is distinguished by its precision text rendering engine, which integrates multi-script calligraphy and layout-coherent alphabetic text into images. It provides specialized capabilities for subject identity preservation and consistent subject generation across different poses and viewpoints, alongside a training pipelin