7 repos
Techniques and configurations that enhance model execution speed, reduce memory usage, and improve computational efficiency during inference.
Explore 7 awesome GitHub repositories matching artificial intelligence & ml · Inference Optimization. Refine with filters or upvote what's useful.
TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The syst
Optimizes execution performance by setting specific model weights to zero through target-aware authoring and specialized kernels.
PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into indepen
Activates optimized execution paths through specific configuration parameters to boost performance in production environments.
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token gen
Dynamically inserts new sequences into active inference batches to maximize hardware utilization.
This project is a comprehensive educational resource and knowledge base dedicated to the development and application of large language models and autonomous agentic systems. It provides a structured framework for understanding prompt engineering, context management, and the architectural patterns required to build task
Reviews high-performance infrastructure solutions designed to minimize latency and maximize throughput for model inference.
Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on
Reduces numerical precision in model weights to lower memory footprint and accelerate inference on local devices.
YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning
Decreases model size and improves execution speed by setting a specific percentage of weights to zero.
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade
Predicts multiple future tokens in parallel to accelerate the generation process and reduce total processing steps.