awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Inference Optimization · Awesome GitHub Repositories

7 repos

Awesome GitHub RepositoriesInference Optimization

Techniques and configurations that enhance model execution speed, reduce memory usage, and improve computational efficiency during inference.

Explore 7 awesome GitHub repositories matching artificial intelligence & ml · Inference Optimization. Refine with filters or upvote what's useful.

  1. Home
  2. Artificial Intelligence & ML
  3. Machine Learning
  4. Infrastructure
  5. Model Inference and Serving
  6. Inference Optimization

Awesome Inference Optimization GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • tensorflow/tensorflow

    tensorflow/tensorflow

    193,864GitHubView on GitHub↗

    TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The syst

    Optimizes execution performance by setting specific model weights to zero through target-aware authoring and specialized kernels.

    C++deep-learningdeep-neural-networksdistributed
  • PaddlePaddle/PaddleOCR

    PaddlePaddle/PaddleOCR

    70,931GitHubView on GitHub↗

    PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into indepen

    Activates optimized execution paths through specific configuration parameters to boost performance in production environments.

    Pythonai4sciencechineseocrdocument-parsing
  • vllm-project/vllm

    vllm-project/vllm

    70,745GitHubView on GitHub↗

    vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token gen

    Dynamically inserts new sequences into active inference batches to maximize hardware utilization.

    Pythonamdblackwellcuda
  • dair-ai/Prompt-Engineering-Guide

    dair-ai/Prompt-Engineering-Guide

    70,526GitHubView on GitHub↗

    This project is a comprehensive educational resource and knowledge base dedicated to the development and application of large language models and autonomous agentic systems. It provides a structured framework for understanding prompt engineering, context management, and the architectural patterns required to build task

    Reviews high-performance infrastructure solutions designed to minimize latency and maximize throughput for model inference.

    MDXagentagentsai-agents
  • meta-llama/llama

    meta-llama/llama

    59,157GitHubView on GitHub↗

    Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on

    Reduces numerical precision in model weights to lower memory footprint and accelerate inference on local devices.

    Python
  • ultralytics/yolov5

    ultralytics/yolov5

    56,830GitHubView on GitHub↗

    YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning

    Decreases model size and improves execution speed by setting a specific percentage of weights to zero.

    Pythoncoremldeep-learningios
  • unslothai/unsloth

    unslothai/unsloth

    52,461GitHubView on GitHub↗

    Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade

    Predicts multiple future tokens in parallel to accelerate the generation process and reduce total processing steps.

    Pythonagentdeepseekdeepseek-r1

Explore sub-tags

  • Continuous Batching StrategiesTechniques that dynamically insert new requests into active inference batches to maintain high hardware utilization.
  • High-Performance Inference ModesConfiguration parameters that enable optimized execution paths for production workloads.
  • Inference Acceleration Techniques1 sub-tagMethods and strategies designed to increase the speed of text generation by optimizing token prediction processes.
Memory-Mapped Weight Loaders
Mechanisms that map model weight files directly into process memory to reduce RAM usage and improve load times.
  • Model SparsityTechniques that reduce model size and improve execution performance by setting a portion of weights to zero.
  • Quantization StrategiesTechniques for reducing the numerical precision of model weights and activations to optimize inference speed and memory usage.