awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Local and On-Device Inference · Awesome GitHub Repositories

9 repos

Awesome GitHub RepositoriesLocal and On-Device Inference

Explore 9 awesome GitHub repositories matching artificial intelligence & ml · Local and On-Device Inference. Refine with filters or upvote what's useful.

  1. Home
  2. Artificial Intelligence & ML
  3. Machine Learning
  4. Infrastructure
  5. Deployment & Serving
  6. Local and On-Device Inference

Awesome Local and On-Device Inference GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • ggml-org/llama.cpp

    ggml-org/llama.cpp

    95,400GitHubView on GitHub↗

    Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU archi

    Terminal-based utilities allow for direct interaction with models, including configuration of inference parameters and chat management.

    C++ggml
  • nomic-ai/gpt4all

    nomic-ai/gpt4all

    77,146GitHubView on GitHub↗

    GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a compreh

    Enables private, offline inference by running large language models directly on local hardware resources.

    C++ai-chatllm-inference
  • zed-industries/zed

    zed-industries/zed

    75,634GitHubView on GitHub↗

    Zed is an AI-native, high-performance code editor designed for extreme responsiveness and keyboard-centric workflows. It functions as an extensible text processing workspace that integrates autonomous agents and predictive models directly into the development environment to automate complex engineering tasks, refactori

    Runs machine learning models on local hardware to ensure data privacy and reduce latency for AI-assisted coding tasks.

    Rustgpuirust-langtext-editor
  • infiniflow/ragflow

    infiniflow/ragflow

    73,425GitHubView on GitHub↗

    This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin

    Configures local inference engines and external model providers through a unified interface for seamless deployment.

    Pythonagentagenticagentic-ai
  • vllm-project/vllm

    vllm-project/vllm

    70,745GitHubView on GitHub↗

    vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token gen

    Enables execution of advanced generative models directly on local hardware for private and low-latency inference.

    Pythonamdblackwellcuda
  • hiyouga/LlamaFactory

    hiyouga/LlamaFactory

    67,386GitHubView on GitHub↗

    LlamaFactory is a unified framework for fine-tuning and adapting large language models. It provides a comprehensive platform that standardizes training workflows across diverse machine learning architectures, allowing users to execute both full-tuning and parameter-efficient methods through a single interface. The pro

    Hosts models locally to serve low-latency predictions through standard network APIs.

    Pythonagentaideepseek
  • meta-llama/llama

    meta-llama/llama

    59,157GitHubView on GitHub↗

    Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on

    Runs generative models directly on consumer hardware to maintain data privacy and eliminate dependency on cloud services.

    Python
  • zylon-ai/private-gpt

    zylon-ai/private-gpt

    57,116GitHubView on GitHub↗

    This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov

    Runs generative language models directly on local hardware for private, offline processing tasks.

    Python
  • ultralytics/ultralytics

    ultralytics/ultralytics

    53,426GitHubView on GitHub↗

    Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification

    Optimizes model weights and architectures for efficient inference on low-power embedded hardware.

    Pythonclicomputer-visiondeep-learning

Explore sub-tags

  • Command Line Inference Interfaces1 sub-tagTerminal-based interfaces that allow users to interact with and manage model inference servers directly from the command line.
  • Edge AI Model DeploymentTechnologies that optimize and deploy machine learning models to run efficiently on local hardware and edge devices.
  • Local AI InferenceSoftware that executes machine learning models directly on local hardware resources to ensure privacy and reduce latency.
Local API Servers
Implementations of standard AI API interfaces (e.g., OpenAI-compatible) running on local infrastructure.
  • Local Inference EnginesSoftware frameworks that enable the execution of generative artificial intelligence models directly on local computing hardware.
  • Local LLM ConfigurationsSettings and configuration files required to optimize and run large language models on local computing infrastructure.
  • Local Model Inference ServersComponents that host models locally to provide low-latency predictions via standard network APIs.
  • Privacy-First AI BackendsInfrastructure that ensures data privacy and security by processing AI requests locally rather than on external servers.