High-performance tools for running large language models locally using Metal acceleration on Apple hardware.
MLC LLM is a machine learning compiler and inference engine designed to execute large language models locally across diverse hardware platforms, including desktop, mobile, and web environments. By utilizing machine learning compilation, the project transforms high-level model definitions into specialized, hardware-specific binary libraries. This process optimizes model weights and generates compute kernels tailored to the unique memory and processing characteristics of target graphics and mobile hardware. The engine distinguishes itself by providing a unified runtime abstraction that enables native execution on consumer hardware while maintaining compatibility with standard development workflows. It includes a local server architecture that exposes inference endpoints compatible with common chat completion patterns, allowing developers to integrate private, offline language models into external applications. The toolchain supports the entire lifecycle of model deployment, from the conversion and quantization of weights to the generation of standalone binary libraries. These capabilities ensure that models run efficiently with minimal runtime dependencies, regardless of the underlying hardware backend. The project provides both a command-line interface for direct interaction and programmatic interfaces for embedding model execution into custom application logic.
MLC LLM is a high-performance inference engine that natively supports Metal acceleration on Apple Silicon, handles GGUF-compatible model quantization, and provides a local API server for model serving.
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal grammars to force model outputs to adhere to specific JSON schemas or patterns, and it implements speculative decoding to increase inference speed. Broad capabilities include hardware acceleration for GPUs, tools for converting models between different data formats, and utilities for measuring model quality via perplexity and divergence metrics. The engine can be wrapped in an HTTP server that provides an OpenAI-compatible API for integration with external tools.
This is the industry-standard inference engine for running local LLMs, providing native Metal acceleration for Apple Silicon, full GGUF support, and an OpenAI-compatible API server.
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights. What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vector spaces. This capability enables context-aware chat sessions where the model can reference private files, notes, and spreadsheets to provide grounded, relevant responses. The system also features a local HTTP server that exposes an OpenAI-compatible API, allowing developers to integrate these private, self-hosted models into existing applications and workflows. Beyond its core inference and retrieval capabilities, the project includes a graphical desktop interface for end-user interaction and a Python software development kit for programmatic access. These tools support advanced configuration of model parameters, performance monitoring, and the management of local embedding pipelines for custom semantic search tasks. The software is distributed as a unified application package, with documentation available to guide users through installation and local environment setup.
GPT4All is a comprehensive local inference engine that provides full Metal acceleration for Apple Silicon, supports GGUF model formats, and includes an OpenAI-compatible API for local integration.
Jan is a local language model desktop application and AI assistant orchestrator. It provides a unified interface for interacting with both resident models and remote cloud AI providers. The project functions as a host for the Model Context Protocol, connecting AI models to external tools and data sources. It also operates as an OpenAI compatible API server, exposing local models through a standardized server endpoint for other applications to query. The system supports the creation of specialized AI personas with custom instructions and allows for the management of hybrid model environments, switching between offline local execution and external cloud APIs.
Jan is a desktop application that provides a user-friendly interface for running local LLMs on Apple Silicon, supporting GGUF models and providing an OpenAI-compatible API server for local inference.
llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs. The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory across system RAM and VRAM. The library covers a broad range of AI capabilities, including text completion, embedding generation, and the enforcement of structured outputs via JSON schemas or formal grammars. It also provides infrastructure for tool use through external function calling and manages model extensions via LoRA adapter injection. Users can fetch model files directly from Hugging Face and maintain model state persistence for resuming generation.
This library provides a Python interface for running local LLMs with Metal acceleration and GGUF support, offering an OpenAI-compatible API that makes it a functional engine for local inference.
Jan is a desktop application that functions as a local artificial intelligence model runtime and an open-standard API server. It enables the execution of large language models directly on local hardware, ensuring that data remains private and accessible offline while providing a unified interface for managing model weights and inference runtimes. The platform distinguishes itself by offering a modular inference backend that allows users to swap execution engines based on hardware compatibility and performance needs. It acts as a cross-platform orchestrator, providing the ability to switch between local model files and remote cloud-based AI providers through a single interface. By exposing these capabilities via an open-standard server layer, the application supports the integration of local AI into external software and development tools. Beyond its core runtime capabilities, the software provides an environment for configuring agentic workflows and autonomous task automation. It includes tools for managing server behaviors, such as network access, authentication, and remote tool execution, while maintaining state persistence through a local file-based database. The application is distributed as a cross-platform container to ensure consistent access to local files and system resources across different operating systems.
Jan is a desktop application that provides a local runtime for LLMs with native support for Apple Silicon via its underlying inference engines, offering the requested GGUF compatibility, Metal acceleration, and an OpenAI-compatible API server.
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for integrating local models with existing chat clients. The project covers broad capability areas including distributed model inference across multiple nodes, GPU hardware acceleration for Apple Metal and other processors, and structured text generation using formal grammars to constrain outputs. It also implements memory management techniques such as hybrid memory offloading, weight quantization, and CPU core affinity binding.
PowerInfer is a high-performance local inference engine that natively supports Apple Metal acceleration, GGUF format conversion, and provides an OpenAI-compatible API, making it a comprehensive solution for running LLMs on Apple Silicon.
This application provides a user-friendly interface for running various local models, including support for RWKV and other architectures, with built-in GPU acceleration and OpenAI-compatible API endpoints.
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool execution framework that runs server-side tools like code execution, shell commands, and web search in an automated loop during model generation, with session state persistence. It provides an in-process inference engine that can be embedded directly into Rust or Python applications without a separate server process, and includes an in-situ quantization engine that converts model weights to lower precision at load time with per-layer tuning. The system supports structured output constraints, forcing model output to conform to JSON Schema or grammar specifications during decoding, and offers automatic architecture detection that identifies model type, quantization format, and chat template from a Hugging Face model ID. The platform includes capabilities for managing LoRA adapters, composing models as mixture-of-experts configurations, and running distributed inference across multiple GPUs or nodes using tensor parallelism and ring transport. It provides a built-in web chat interface, supports speculative decoding with a smaller assistant model, and offers benchmarking, logging, and Prometheus metrics for monitoring. The project can be run from a configuration file, with options for customizing build processes, tuning hardware settings automatically, and managing model caches.
This inference engine supports local model execution with quantization and API compatibility, and while it is built in Rust for high performance, it provides the necessary GPU acceleration and model support to run LLMs on Apple Silicon.
SakuraLLM is a multi-format document translation system that hosts large language models for translating Japanese text into other languages. It functions as an inference server that exposes translation models through an OpenAI-compatible API, allowing any tool supporting the OpenAI client format to send translation requests. The system is designed as a glossary-aware translation engine that applies user-defined term dictionaries to ensure consistent translation of proper nouns and names across outputs. The project distinguishes itself by supporting multiple high-performance inference backends including llama.cpp, vLLM, and Ollama, enabling flexible deployment across consumer CPU and GPU hardware. It features a format-preserving translation pipeline that extracts, translates, and reassembles text from structured formats like ebooks and subtitles while retaining timestamps, line breaks, and markup. The system also supports CPU-GPU hybrid inference for memory-constrained setups, tensor parallel multi-GPU distribution for larger models, and token probability filtering to refine translation precision. SakuraLLM provides translation capabilities for ebooks, subtitles, visual novels, galgames, RPG Maker games, manga, and plain-text novels. It processes documents by dividing long texts into manageable segments, translating each segment through the language model, and reassembling the output with original formatting intact. The system includes glossary management for maintaining terminology consistency, degeneration detection that monitors token generation and retries with adjusted parameters when output quality degrades, and multi-threaded inference for improved throughput. The project offers a Docker-based deployment with API authentication and supports running on consumer NVIDIA and AMD GPUs.
This is an inference server that leverages backends like llama.cpp and Ollama to run models locally, providing the requested API compatibility and hardware acceleration for your translation tasks.
PrivateGPT is a private AI document assistant and local knowledge base manager designed for querying private files and documents using retrieval-augmented generation. It functions as a local language model application and API gateway, allowing users to obtain cited answers from unstructured data without sending information to external servers. The system differentiates itself by acting as a tool integrator that connects language models to external functions, including web search, tabular data analysis, and custom action extensions. It provides a standardized API layer that allows local inference servers to communicate with third-party applications and execute multi-step agentic workflows. The platform covers a broad capability surface including document-to-embedding pipelines, vector database indexing, and the processing of tabular data from CSV files. It also supports asynchronous request handling, response streaming, and API interaction debugging for troubleshooting model exchanges.
PrivateGPT is a local AI application that leverages local inference engines to power its document-based RAG workflows, though it functions primarily as an end-user assistant rather than a dedicated low-level inference engine.
Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU architectures. The project distinguishes itself by offering a lightweight HTTP server that adheres to standard API specifications, enabling chat completion, embeddings, and reranking services. It includes a suite of tools for model quantization and conversion, which reduces memory usage and improves performance, alongside a command-line interface for managing chat templates and inference parameters. The ecosystem further supports structured data generation through grammar-based output constraints and provides diagnostic utilities for visualizing computational graphs. Comprehensive documentation is available, including a reference matrix that details the compatibility of computational operations across supported hardware backends.
This is the primary inference engine for running LLMs locally, offering native Metal acceleration for Apple Silicon, full GGUF support, and a built-in API server for model deployment.
ComfyUI-GGUF is a memory optimizer and model loader for ComfyUI that enables the execution of large transformer-based generative models using quantized weights. It provides a system for loading GGUF formatted weights within a node-based diffusion interface to reduce GPU memory consumption. The project includes a quantization tool for converting standard model checkpoints into compressed binary formats and a tensor fixer to restore missing keys and correct architectures in binary model files. These utilities ensure that compressed models remain functional during inference on hardware with limited VRAM. The framework covers model weight optimization and low-memory inference by supporting the loading of quantized diffusion models and text encoders. It manages the process of on-the-fly precision recovery and weight mapping to maintain performance while reducing the total memory footprint.
This repository is a custom node suite for the ComfyUI diffusion interface rather than a standalone local LLM inference engine, serving as a specialized tool for loading quantized weights within a generative image pipeline.
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughput. Capability areas cover the full model lifecycle, including supervised fine-tuning and preference optimization via parameter-efficient LoRA adapters. The system supports structured tool calling for external agent integration and provides various serving options, including OpenAI-compatible APIs, REST endpoints, and a command-line interface. The implementation includes tools for converting model checkpoints between formats and distributing training workloads across multiple GPUs.
This repository provides a collection of optimized language models and the necessary tools for local inference, including GGUF support and API compatibility for Apple Silicon, though it functions primarily as a model suite rather than a standalone inference engine application.
Operit is a private, voice-enabled AI agent designed to run quantized large language models offline within mobile Linux environments. It functions as a plugin-based agent that combines local inference with a hands-free interaction pipeline. The system distinguishes itself through the use of role cards to manage distinct AI personas and conversation histories. It integrates a voice-driven interface utilizing speech-to-text and text-to-speech modules, and it enables device automation by dispatching shell commands and accessibility services to navigate user interfaces. The project further covers remote workspace synchronization via standard file transfer protocols and a memory management system that summarizes conversation history. It also provides a framework for connecting external services and custom scripts to the AI agent through standardized tool interfaces.
This project is a voice-enabled AI agent designed for mobile Linux and Android environments rather than a general-purpose local inference engine for macOS, making it a specialized application rather than the framework you are seeking.
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model distribution across multiple GPUs, static prompt state caching to avoid re-encoding repeated inputs, and CPU instruction set dispatch that selects the optimal code path for the hardware. An asynchronous inference queue allows overlapping computation with other work, while the OpenAI-compatible REST API enables drop-in integration with existing applications. CTranslate2 provides model conversion tools for frameworks including Fairseq, Hugging Face Transformers, Marian, OpenNMT-py, OpenNMT-tf, and OPUS-MT, transforming trained models into an optimized binary format. It supports a range of quantization types such as INT8, FP16, and BF16, with automatic compute type selection based on the available hardware. The engine handles text translation, text generation with configurable decoding strategies like beam search and sampling, sequence scoring, text encoding, and speech transcription, all with streaming input and output capabilities.
CTranslate2 is a high-performance inference engine that supports model quantization and OpenAI-compatible APIs, though it relies on its own optimized binary format rather than native GGUF support and lacks explicit mention of Metal Performance Shaders for Apple Silicon acceleration.
LocalAI is a self-hosted inference server that enables the execution of machine learning models directly on local hardware. By providing a unified interface for text, image, and audio processing, it allows users to maintain full control over data privacy and infrastructure costs while eliminating dependencies on external network services. The platform functions as an API gateway that mimics standard cloud-based artificial intelligence interfaces, allowing existing applications to integrate local models as drop-in replacements. It utilizes a container-based architecture to package runtimes and dependencies, ensuring consistent deployment across diverse hardware configurations. To optimize system performance, the server employs an on-demand orchestration layer that dynamically loads and unloads models based on active requests, minimizing memory usage during periods of inactivity. The system supports a wide range of model architectures through a flexible backend abstraction that allows for driver switching at runtime. Users can manage their models and interact with the service through a web interface or via standard web requests, which the proxy translates into model-specific execution commands. The software is distributed as a containerized application to facilitate deployment across various server and cloud environments.
LocalAI is a comprehensive inference server that supports GGUF models and provides an OpenAI-compatible API, allowing you to run LLMs locally on Apple Silicon through its backend abstraction layers.
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as weight quantization and parameter-efficient fine-tuning via low-rank adaptation, which significantly reduce memory requirements and computational overhead. These features enable the deployment of large models on consumer-grade hardware while maintaining high throughput and performance. Beyond core inference, the toolkit includes a suite of utilities for programmatic integration, allowing developers to embed model capabilities into custom software workflows via standard interfaces. It also provides multiple interactive interfaces, including web-based graphical environments for text and vision tasks and a command-line interface for rapid prototyping and evaluation. The software is distributed as a Python-based package, requiring standard environment configuration to manage dependencies and hardware resource allocation.
This is a local LLM inference engine that supports Apple Silicon and model quantization, though it is specifically tailored for the ChatGLM model architecture rather than the GGUF format requested.
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token generation speed and memory efficiency, enabling both large-scale cloud deployments and local execution on personal hardware. The project distinguishes itself through advanced memory management and request scheduling techniques, most notably its use of non-contiguous key-value cache blocks to eliminate fragmentation and its ability to dynamically insert new sequences into batches as they arrive. It provides a hardware-agnostic abstraction layer that maps complex mathematical operations to diverse accelerators, including specialized GPUs and consumer-grade silicon like Apple hardware. This is further supported by custom kernel fusion and a flexible quantization framework that allows for the compression of neural networks to fit resource-constrained environments. Beyond its core runtime, the framework offers extensive support for custom
vLLM is a high-throughput inference engine that supports Apple Silicon via Metal acceleration and provides the API compatibility and quantization features required for local model serving.
Fauxpilot is a self-hosted AI coding assistant and local inference server. It functions as a proxy and API gateway that redirects traffic from IDE plugins to a local large language model, allowing for AI-assisted programming without external cloud dependencies. The project provides a specialized API emulation layer that mimics coding assistant protocols and a standardized OpenAI-compatible interface. This enables supported code editors to use local models for completions and suggestions by overriding default proxy URLs. The system includes capabilities for downloading and deploying local models, as well as a format-conversion pipeline to transform model files into optimized versions for specific inference engines. A model-agnostic backend allows for switching between different inference engines while maintaining the same API interfaces.
This project is an API proxy and coding assistant server designed to interface with existing IDE plugins, rather than being an inference engine that provides the Metal-accelerated model execution itself.