The visitor wants a software framework or application that enables running Large Language Models locally on macOS using Apple Silicon's GPU via the Metal Performance Shaders framework.

mlc-ai/mlc-llm is the closest match — MLC LLM is a high-performance inference engine that natively supports Metal acceleration on Apple Silicon, handles GGUF-compatible model quantization, and provides a local API server for model serving.. Other strong matches: ggerganov/llama.cpp, nomic-ai/gpt4all, menloresearch/jan, abetlen/llama-cpp-python.

Why does mlc-ai/mlc-llm match “an inference engine optimized for Apple Silicon”?

MLC LLM is a high-performance inference engine that natively supports Metal acceleration on Apple Silicon, handles GGUF-compatible model quantization, and provides a local API server for model serving.

Why does ggerganov/llama.cpp match “an inference engine optimized for Apple Silicon”?

This is the industry-standard inference engine for running local LLMs, providing native Metal acceleration for Apple Silicon, full GGUF support, and an OpenAI-compatible API server.

Why does nomic-ai/gpt4all match “an inference engine optimized for Apple Silicon”?

GPT4All is a comprehensive local inference engine that provides full Metal acceleration for Apple Silicon, supports GGUF model formats, and includes an OpenAI-compatible API for local integration.

Why does menloresearch/jan match “an inference engine optimized for Apple Silicon”?

Jan is a desktop application that provides a user-friendly interface for running local LLMs on Apple Silicon, supporting GGUF models and providing an OpenAI-compatible API server for local inference.

Why does abetlen/llama-cpp-python match “an inference engine optimized for Apple Silicon”?

This library provides a Python interface for running local LLMs with Metal acceleration and GGUF support, offering an OpenAI-compatible API that makes it a functional engine for local inference.

Apple Silicon LLM Inference Engines

High-performance tools for running large language models locally using Metal acceleration on Apple hardware.

Find the best repos with AI.We'll search the best matching repositories with AI.

mlc-ai/mlc-llm
mlc-ai/mlc-llm
22,057View on GitHub
MLC LLM is a machine learning compiler and inference engine designed to execute large language models locally across diverse hardware platforms, including desktop, mobile, and web environments. By utilizing machine learning compilation, the project transforms high-level model definitions into specialized, hardware-specific binary libraries. This process optimizes model weights and generates compute kernels tailored to the unique memory and processing characteristics of target graphics and mobile hardware. The engine distinguishes itself by providing a unified runtime abstraction that enables native execution on consumer hardware while maintaining compatibility with standard development workflows. It includes a local server architecture that exposes inference endpoints compatible with common chat completion patterns, allowing developers to integrate private, offline language models into external applications. The toolchain supports the entire lifecycle of model deployment, from the conversion and quantization of weights to the generation of standalone binary libraries. These capabilities ensure that models run efficiently with minimal runtime dependencies, regardless of the underlying hardware backend. The project provides both a command-line interface for direct interaction and programmatic interfaces for embedding model execution into custom application logic.
MLC LLM is a high-performance inference engine that natively supports Metal acceleration on Apple Silicon, handles GGUF-compatible model quantization, and provides a local API server for model serving.
PythonModel QuantizationOpenAI-Compatible APIsLocal API Servers
View on GitHub22,057
ggerganov/llama.cpp
ggerganov/llama.cpp
116,912View on GitHub
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal grammars to force model outputs to adhere to specific JSON schemas or patterns, and it implements speculative decoding to increase inference speed. Broad capabilities include hardware acceleration for GPUs, tools for converting models between different data formats, and utilities for measuring model quality via perplexity and divergence metrics. The engine can be wrapped in an HTTP server that provides an OpenAI-compatible API for integration with external tools.
This is the industry-standard inference engine for running local LLMs, providing native Metal acceleration for Apple Silicon, full GGUF support, and an OpenAI-compatible API server.
C++Model QuantizationOpenAI-Compatible APIsOpenAI-Compatible Inference Servers
View on GitHub116,912
nomic-ai/gpt4all
nomic-ai/gpt4all
77,375View on GitHub
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights. What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vector spaces. This capability enables context-aware chat sessions where the model can reference private files, notes, and spreadsheets to provide grounded, relevant responses. The system also features a local HTTP server that exposes an OpenAI-compatible API, allowing developers to integrate these private, self-hosted models into existing applications and workflows. Beyond its core inference and retrieval capabilities, the project includes a graphical desktop interface for end-user interaction and a Python software development kit for programmatic access. These tools support advanced configuration of model parameters, performance monitoring, and the management of local embedding pipelines for custom semantic search tasks. The software is distributed as a unified application package, with documentation available to guide users through installation and local environment setup.
GPT4All is a comprehensive local inference engine that provides full Metal acceleration for Apple Silicon, supports GGUF model formats, and includes an OpenAI-compatible API for local integration.
C++OpenAI-CompatibleOpenAI-Compatible APIsLocal API Servers
View on GitHub77,375
menloresearch/jan
menloresearch/jan
43,052View on GitHub
Jan is a local language model desktop application and AI assistant orchestrator. It provides a unified interface for interacting with both resident models and remote cloud AI providers. The project functions as a host for the Model Context Protocol, connecting AI models to external tools and data sources. It also operates as an OpenAI compatible API server, exposing local models through a standardized server endpoint for other applications to query. The system supports the creation of specialized AI personas with custom instructions and allows for the management of hybrid model environments, switching between offline local execution and external cloud APIs.
Jan is a desktop application that provides a user-friendly interface for running local LLMs on Apple Silicon, supporting GGUF models and providing an OpenAI-compatible API server for local inference.
TypeScriptLocal API ServersOpenAI-Compatible Servers
View on GitHub43,052
abetlen/llama-cpp-python
abetlen/llama-cpp-python
9,993View on GitHub
llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs. The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory across system RAM and VRAM. The library covers a broad range of AI capabilities, including text completion, embedding generation, and the enforcement of structured outputs via JSON schemas or formal grammars. It also provides infrastructure for tool use through external function calling and manages model extensions via LoRA adapter injection. Users can fetch model files directly from Hugging Face and maintain model state persistence for resuming generation.
This library provides a Python interface for running local LLMs with Metal acceleration and GGUF support, offering an OpenAI-compatible API that makes it a functional engine for local inference.
PythonModel QuantizationOpenAI-Compatible Inference ServersOpenAI-Compatible Servers
View on GitHub9,993
janhq/jan
janhq/jan
43,043View on GitHub
Jan is a desktop application that functions as a local artificial intelligence model runtime and an open-standard API server. It enables the execution of large language models directly on local hardware, ensuring that data remains private and accessible offline while providing a unified interface for managing model weights and inference runtimes. The platform distinguishes itself by offering a modular inference backend that allows users to swap execution engines based on hardware compatibility and performance needs. It acts as a cross-platform orchestrator, providing the ability to switch between local model files and remote cloud-based AI providers through a single interface. By exposing these capabilities via an open-standard server layer, the application supports the integration of local AI into external software and development tools. Beyond its core runtime capabilities, the software provides an environment for configuring agentic workflows and autonomous task automation. It includes tools for managing server behaviors, such as network access, authentication, and remote tool execution, while maintaining state persistence through a local file-based database. The application is distributed as a cross-platform container to ensure consistent access to local files and system resources across different operating systems.
Jan is a desktop application that provides a local runtime for LLMs with native support for Apple Silicon via its underlying inference engines, offering the requested GGUF compatibility, Metal acceleration, and an OpenAI-compatible API server.
TypeScriptOpenAI-Compatible Servers
View on GitHub43,043
tiiny-ai/powerinfer
Tiiny-AI/PowerInfer
8,714View on GitHub
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for integrating local models with existing chat clients. The project covers broad capability areas including distributed model inference across multiple nodes, GPU hardware acceleration for Apple Metal and other processors, and structured text generation using formal grammars to constrain outputs. It also implements memory management techniques such as hybrid memory offloading, weight quantization, and CPU core affinity binding.
PowerInfer is a high-performance local inference engine that natively supports Apple Metal acceleration, GGUF format conversion, and provides an OpenAI-compatible API, making it a comprehensive solution for running LLMs on Apple Silicon.
C++OpenAI-Compatible APIsOpenAI-Compatible Inference Servers
View on GitHub8,714
josstorer/rwkv-runner
josStorer/RWKV-Runner
6,219View on GitHub
This application provides a user-friendly interface for running various local models, including support for RWKV and other architectures, with built-in GPU acceleration and OpenAI-compatible API endpoints.
TypeScriptOpenAI-CompatibleOpenAI-Compatible Servers
View on GitHub6,219
ericlbuehler/mistral.rs
EricLBuehler/mistral.rs
6,597View on GitHub
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool execution framework that runs server-side tools like code execution, shell commands, and web search in an automated loop during model generation, with session state persistence. It provides an in-process inference engine that can be embedded directly into Rust or Python applications without a separate server process, and includes an in-situ quantization engine that converts model weights to lower precision at load time with per-layer tuning. The system supports structured output constraints, forcing model output to conform to JSON Schema or grammar specifications during decoding, and offers automatic architecture detection that identifies model type, quantization format, and chat template from a Hugging Face model ID. The platform includes capabilities for managing LoRA adapters, composing models as mixture-of-experts configurations, and running distributed inference across multiple GPUs or nodes using tensor parallelism and ring transport. It provides a built-in web chat interface, supports speculative decoding with a smaller assistant model, and offers benchmarking, logging, and Prometheus metrics for monitoring. The project can be run from a configuration file, with options for customizing build processes, tuning hardware settings automatically, and managing model caches.
This inference engine supports local model execution with quantization and API compatibility, and while it is built in Rust for high performance, it provides the necessary GPU acceleration and model support to run LLMs on Apple Silicon.
RustModel QuantizationOpenAI-CompatibleOpenAI-Compatible APIs
View on GitHub6,597
sakurallm/sakurallm
SakuraLLM/SakuraLLM
4,618View on GitHub
SakuraLLM is a multi-format document translation system that hosts large language models for translating Japanese text into other languages. It functions as an inference server that exposes translation models through an OpenAI-compatible API, allowing any tool supporting the OpenAI client format to send translation requests. The system is designed as a glossary-aware translation engine that applies user-defined term dictionaries to ensure consistent translation of proper nouns and names across outputs. The project distinguishes itself by supporting multiple high-performance inference backends including llama.cpp, vLLM, and Ollama, enabling flexible deployment across consumer CPU and GPU hardware. It features a format-preserving translation pipeline that extracts, translates, and reassembles text from structured formats like ebooks and subtitles while retaining timestamps, line breaks, and markup. The system also supports CPU-GPU hybrid inference for memory-constrained setups, tensor parallel multi-GPU distribution for larger models, and token probability filtering to refine translation precision. SakuraLLM provides translation capabilities for ebooks, subtitles, visual novels, galgames, RPG Maker games, manga, and plain-text novels. It processes documents by dividing long texts into manageable segments, translating each segment through the language model, and reassembling the output with original formatting intact. The system includes glossary management for maintaining terminology consistency, degeneration detection that monitors token generation and retries with adjusted parameters when output quality degrades, and multi-threaded inference for improved throughput. The project offers a Docker-based deployment with API authentication and supports running on consumer NVIDIA and AMD GPUs.
This is an inference server that leverages backends like llama.cpp and Ollama to run models locally, providing the requested API compatibility and hardware acceleration for your translation tasks.
PythonOpenAI-Compatible APIsllama.cpp Backend Runnersllama.cpp Backend Servers
View on GitHub4,618
imartinez/privategpt
imartinez/privateGPT
57,281View on GitHub
PrivateGPT is a private AI document assistant and local knowledge base manager designed for querying private files and documents using retrieval-augmented generation. It functions as a local language model application and API gateway, allowing users to obtain cited answers from unstructured data without sending information to external servers. The system differentiates itself by acting as a tool integrator that connects language models to external functions, including web search, tabular data analysis, and custom action extensions. It provides a standardized API layer that allows local inference servers to communicate with third-party applications and execute multi-step agentic workflows. The platform covers a broad capability surface including document-to-embedding pipelines, vector database indexing, and the processing of tabular data from CSV files. It also supports asynchronous request handling, response streaming, and API interaction debugging for troubleshooting model exchanges.
PrivateGPT is a local AI application that leverages local inference engines to power its document-based RAG workflows, though it functions primarily as an end-user assistant rather than a dedicated low-level inference engine.
PythonOpenAI-Compatible APIsLocal API Servers
View on GitHub57,281
ggml-org/llama.cpp
ggml-org/llama.cpp
116,799View on GitHub
Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU architectures. The project distinguishes itself by offering a lightweight HTTP server that adheres to standard API specifications, enabling chat completion, embeddings, and reranking services. It includes a suite of tools for model quantization and conversion, which reduces memory usage and improves performance, alongside a command-line interface for managing chat templates and inference parameters. The ecosystem further supports structured data generation through grammar-based output constraints and provides diagnostic utilities for visualizing computational graphs. Comprehensive documentation is available, including a reference matrix that details the compatibility of computational operations across supported hardware backends.
This is the primary inference engine for running LLMs locally, offering native Metal acceleration for Apple Silicon, full GGUF support, and a built-in API server for model deployment.
C++Hardware Abstraction LayersText-Only Inference EnginesMultimodal Inference Engines
View on GitHub116,799
city96/comfyui-gguf
city96/ComfyUI-GGUF
3,291View on GitHub
ComfyUI-GGUF is a memory optimizer and model loader for ComfyUI that enables the execution of large transformer-based generative models using quantized weights. It provides a system for loading GGUF formatted weights within a node-based diffusion interface to reduce GPU memory consumption. The project includes a quantization tool for converting standard model checkpoints into compressed binary formats and a tensor fixer to restore missing keys and correct architectures in binary model files. These utilities ensure that compressed models remain functional during inference on hardware with limited VRAM. The framework covers model weight optimization and low-memory inference by supporting the loading of quantized diffusion models and text encoders. It manages the process of on-the-fly precision recovery and weight mapping to maintain performance while reducing the total memory footprint.
This repository is a custom node suite for the ComfyUI diffusion interface rather than a standalone local LLM inference engine, serving as a specialized tool for loading quantized weights within a generative image pipeline.
PythonGGUF ExecutionModel QuantizationGGUF Weight Quantization
View on GitHub3,291
openbmb/minicpm
OpenBMB/MiniCPM
9,464View on GitHub
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughput. Capability areas cover the full model lifecycle, including supervised fine-tuning and preference optimization via parameter-efficient LoRA adapters. The system supports structured tool calling for external agent integration and provides various serving options, including OpenAI-compatible APIs, REST endpoints, and a command-line interface. The implementation includes tools for converting model checkpoints between formats and distributing training workloads across multiple GPUs.
This repository provides a collection of optimized language models and the necessary tools for local inference, including GGUF support and API compatibility for Apple Silicon, though it functions primarily as a model suite rather than a standalone inference engine application.
Jupyter NotebookOpenAI-Compatible APIsGGUF Weight QuantizationOpenAI-Compatible Servers
View on GitHub9,464
aaswordman/operit
AAswordman/Operit
3,373View on GitHub
Operit is a private, voice-enabled AI agent designed to run quantized large language models offline within mobile Linux environments. It functions as a plugin-based agent that combines local inference with a hands-free interaction pipeline. The system distinguishes itself through the use of role cards to manage distinct AI personas and conversation histories. It integrates a voice-driven interface utilizing speech-to-text and text-to-speech modules, and it enables device automation by dispatching shell commands and accessibility services to navigate user interfaces. The project further covers remote workspace synchronization via standard file transfer protocols and a memory management system that summarizes conversation history. It also provides a framework for connecting external services and custom scripts to the AI agent through standardized tool interfaces.
This project is a voice-enabled AI agent designed for mobile Linux and Android environments rather than a general-purpose local inference engine for macOS, making it a specialized application rather than the framework you are seeking.
KotlinGGUF AssistantsGGUF Execution
View on GitHub3,373
opennmt/ctranslate2
OpenNMT/CTranslate2
4,319View on GitHub
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model distribution across multiple GPUs, static prompt state caching to avoid re-encoding repeated inputs, and CPU instruction set dispatch that selects the optimal code path for the hardware. An asynchronous inference queue allows overlapping computation with other work, while the OpenAI-compatible REST API enables drop-in integration with existing applications. CTranslate2 provides model conversion tools for frameworks including Fairseq, Hugging Face Transformers, Marian, OpenNMT-py, OpenNMT-tf, and OPUS-MT, transforming trained models into an optimized binary format. It supports a range of quantization types such as INT8, FP16, and BF16, with automatic compute type selection based on the available hardware. The engine handles text translation, text generation with configurable decoding strategies like beam search and sampling, sequence scoring, text encoding, and speech transcription, all with streaming input and output capabilities.
CTranslate2 is a high-performance inference engine that supports model quantization and OpenAI-compatible APIs, though it relies on its own optimized binary format rather than native GGUF support and lacks explicit mention of Metal Performance Shaders for Apple Silicon acceleration.
C++Model QuantizationOpenAI-Compatible APIsOpenAI-Compatible API Servers
View on GitHub4,319
mudler/localai
mudler/LocalAI
46,889View on GitHub
LocalAI is a self-hosted inference server that enables the execution of machine learning models directly on local hardware. By providing a unified interface for text, image, and audio processing, it allows users to maintain full control over data privacy and infrastructure costs while eliminating dependencies on external network services. The platform functions as an API gateway that mimics standard cloud-based artificial intelligence interfaces, allowing existing applications to integrate local models as drop-in replacements. It utilizes a container-based architecture to package runtimes and dependencies, ensuring consistent deployment across diverse hardware configurations. To optimize system performance, the server employs an on-demand orchestration layer that dynamically loads and unloads models based on active requests, minimizing memory usage during periods of inactivity. The system supports a wide range of model architectures through a flexible backend abstraction that allows for driver switching at runtime. Users can manage their models and interact with the service through a web interface or via standard web requests, which the proxy translates into model-specific execution commands. The software is distributed as a containerized application to facilitate deployment across various server and cloud environments.
LocalAI is a comprehensive inference server that supports GGUF models and provides an OpenAI-compatible API, allowing you to run LLMs locally on Apple Silicon through its backend abstraction layers.
GoLocal Model Serving
View on GitHub46,889
zai-org/chatglm-6b
zai-org/ChatGLM-6B
41,039View on GitHub
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as weight quantization and parameter-efficient fine-tuning via low-rank adaptation, which significantly reduce memory requirements and computational overhead. These features enable the deployment of large models on consumer-grade hardware while maintaining high throughput and performance. Beyond core inference, the toolkit includes a suite of utilities for programmatic integration, allowing developers to embed model capabilities into custom software workflows via standard interfaces. It also provides multiple interactive interfaces, including web-based graphical environments for text and vision tasks and a command-line interface for rapid prototyping and evaluation. The software is distributed as a Python-based package, requiring standard environment configuration to manage dependencies and hardware resource allocation.
This is a local LLM inference engine that supports Apple Silicon and model quantization, though it is specifically tailored for the ChatGLM model architecture rather than the GGUF format requested.
PythonModel Quantization
View on GitHub41,039
vllm-project/vllm
vllm-project/vllm
83,048View on GitHub
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token generation speed and memory efficiency, enabling both large-scale cloud deployments and local execution on personal hardware. The project distinguishes itself through advanced memory management and request scheduling techniques, most notably its use of non-contiguous key-value cache blocks to eliminate fragmentation and its ability to dynamically insert new sequences into batches as they arrive. It provides a hardware-agnostic abstraction layer that maps complex mathematical operations to diverse accelerators, including specialized GPUs and consumer-grade silicon like Apple hardware. This is further supported by custom kernel fusion and a flexible quantization framework that allows for the compression of neural networks to fit resource-constrained environments. Beyond its core runtime, the framework offers extensive support for custom
vLLM is a high-throughput inference engine that supports Apple Silicon via Metal acceleration and provides the API compatibility and quantization features required for local model serving.
PythonModel Quantization
View on GitHub83,048
fauxpilot/fauxpilot
fauxpilot/fauxpilot
14,732View on GitHub
Fauxpilot is a self-hosted AI coding assistant and local inference server. It functions as a proxy and API gateway that redirects traffic from IDE plugins to a local large language model, allowing for AI-assisted programming without external cloud dependencies. The project provides a specialized API emulation layer that mimics coding assistant protocols and a standardized OpenAI-compatible interface. This enables supported code editors to use local models for completions and suggestions by overriding default proxy URLs. The system includes capabilities for downloading and deploying local models, as well as a format-conversion pipeline to transform model files into optimized versions for specific inference engines. A model-agnostic backend allows for switching between different inference engines while maintaining the same API interfaces.
This project is an API proxy and coding assistant server designed to interface with existing IDE plugins, rather than being an inference engine that provides the Metal-accelerated model execution itself.
PythonOpenAI-CompatibleOpenAI-Compatible APIs
View on GitHub14,732

Apple Silicon LLM Inference Engines

mlc-ai/mlc-llm

ggerganov/llama.cpp

nomic-ai/gpt4all

menloresearch/jan

abetlen/llama-cpp-python

janhq/jan

Tiiny-AI/PowerInfer

josStorer/RWKV-Runner

EricLBuehler/mistral.rs

SakuraLLM/SakuraLLM

imartinez/privateGPT

ggml-org/llama.cpp

city96/ComfyUI-GGUF

OpenBMB/MiniCPM

AAswordman/Operit

OpenNMT/CTranslate2

mudler/LocalAI

zai-org/ChatGLM-6B

vllm-project/vllm

fauxpilot/fauxpilot