Tensorrtx

tensorrtx is a computer vision inference engine and model implementation library designed for graphics processor acceleration. It provides a framework for optimizing deep learning models through a GPU inference optimizer, a deep learning model converter for transforming weights from frameworks like TensorFlow and PyTorch, and a custom plugin library to implement operations not natively supported by the TensorRT API.

The project distinguishes itself through a comprehensive collection of pre-defined network implementations, ranging from various YOLO versions and DETR transformers for object detection to Vision Transformers and diverse image classification architectures. It enables high-performance deployment by compiling trained model parameters into hardware-specific binary engines and implementing custom C++ extensions for non-standard neural network layers.

The system covers a broad range of visual processing capabilities, including face and human pose analysis, image segmentation, and text recognition. It also supports road analysis for driving perception, such as lane detection and panoptic segmentation, as well as temporal analysis for action recognition in video.

Performance is managed through quantization calibration for integer precision, dynamic shape profiling for variable batch sizes, and multi-GPU scaling to distribute inference tasks across multiple hardware contexts.

Features

GPU Accelerated Computer Vision - Provides a high-performance engine for executing computer vision tasks accelerated by graphics hardware.
TensorRT Model Implementations - Provides a comprehensive library of popular deep learning networks implemented via the TensorRT API for high-performance inference.
Object Detection - Identifies and locates objects within images using a wide variety of high-performance architectures on graphics processors.
Deep Learning Inference Engines - Implements common deep learning networks like vision transformers and object detectors for accelerated hardware inference.
GPU-Accelerated Inference - Optimizes inference latency on GPUs through quantization, dynamic shape profiling, and precision management.
Inference Network Definition APIs - Implements deep learning architectures using network definition APIs to run optimized inference on graphics processors.
Inference Deployment Engines - Builds convolutional neural network engines from weights for optimized execution on graphics processors.
Model Inference - Processes input data through defined network architectures to generate predictions and outputs.
ONNX-to-TensorRT Conversions - Transforms models into ONNX format or TensorRT engines for lower latency on NVIDIA GPUs.
Calibration-Driven Quantization - Analyzes representative image sets to determine optimal dynamic ranges for converting floating-point weights to 8-bit integers.
Model Serializers - Saves compiled network engines to persistent storage to avoid repeating the build process.
Model Weight Converters - Transforms trained weights from TensorFlow and PyTorch into formats compatible with hardware-accelerated engines.
Runtime Weight Loading - Loads trained parameters from external files to initialize hardware-based network engines at runtime.
Neural Network Implementations - Utilizes pre-defined model implementations to run high-performance inference via network definition APIs.
Precision Configuration Tools - Provides utilities for configuring various numerical precision modes to balance throughput and accuracy.
Precision Quantization - Reduces model precision using calibration images to decrease memory consumption and increase inference throughput.
YOLO Object Detectors - Implements object detection architectures to perform high-performance visual recognition across various model scales.
Engine Serialization - Compiles trained model parameters into hardware-specific binary engines to avoid repeated optimization during deployment.
Model Weight Conversions - Converts PyTorch pretrained weights into specialized execution formats for GPU optimization.
Inference Engine Compilers - Compiles neural network architectures into optimized hardware-specific engines for high-performance execution.
Model Conversion - Transforms trained model checkpoints into optimized formats for accelerated hardware deployment.
TensorFlow - Converts TensorFlow model formats into optimized representations for C++ inference execution.
Custom Operator Plugins - Provides custom operator plugins to implement non-standard neural network layers not natively supported by the TensorRT API.
Action Recognition - Classifies human actions in video sequences using a GPU-optimized Temporal Shift Module.
Asynchronous Layer Execution - Executes specialized network layers using non-blocking streams to prevent processor synchronization and increase throughput.
Instance Segmentation Engines - Generates pixel-level masks to identify and isolate individual object instances using high-performance hardware.
Face Detection - Implements RetinaFace architectures to identify faces in images using hardware-accelerated inference.
Image Classification Models - Builds an Alexnet image classification model using convolutional and fully connected layers.
Panoptic Segmentation - Implements panoptic segmentation to identify and mask both individual objects and drivable road areas.
ResNet Variants - Executes image recognition using various ResNet architectures optimized for high-performance acceleration.
Dynamic Tensor Shapes - Provides capabilities for managing variable batch sizes and input dimensions through optimization profiles during model execution.
EfficientNet Implementations - Executes EfficientNet model predictions by converting weights into a serialized hardware engine.
DETR Implementations - Executes end-to-end object detection using transformer architectures optimized for high-performance hardware.
Face Detection - Identifies human faces within images using high-performance neural network architectures optimized for GPU acceleration.
Face Recognition - Executes ArcFace network architectures to calculate similarity scores for biometric identification and verification.
Feed-Forward Neural Networks - Builds a basic feed-forward neural network architecture to perform inference on graphics processors.
Image Classification - Executes high-resolution network architectures to classify images using optimized inference.
Inception Implementations - Implements InceptionV3 and InceptionV4 networks optimized for hardware acceleration via C++.
VGG Implementations - Executes the VGG convolutional neural network to perform image recognition using hardware acceleration.
YOLOv5 Implementations - Implements various versions of the YOLOv5 architecture optimized for graphics processor acceleration.
RepVGG Implementations - Executes VGG-style convolutional networks optimized for graphics processors via the API.
SqueezeNet Implementations - Executes the SqueezeNet architecture on graphics processors to perform classification with reduced computation.
YOLO11 Implementations - Implements YOLO11 network architectures optimized for high-performance inference on graphics processors.
YOLOv7 Implementations - Implements YOLOv7 network architectures optimized for high-performance inference on graphics processors.
CenterNet Implementations - Runs an object detection network using a high-performance inference engine optimized for graphics processors.
RefineDet Implementations - Executes object detection networks using optimized hardware acceleration for high-performance predictions.
Scaled-YOLOv4 Implementations - Performs object detection by loading pre-trained weights and configurations into a hardware-accelerated environment.
Inference Stream Parallelism - Distributes inference tasks across multiple hardware contexts and non-blocking streams to increase total device throughput.
Lane Detection - Executes high-performance lane detection inference using optimized network architectures for real-time processing.
License Plate Recognition - Provides an optimized network architecture specifically for identifying and recognizing vehicle license plates.
Transformer-Based Architectures - Executes end-to-end object detection using transformer-based architectures optimized for hardware acceleration.
Swin Transformer Implementations - Executes semantic segmentation tasks using Swin Transformer architectures optimized for graphics processors.
Mixed-Precision Computing - Allows toggling between FP32, FP16, and INT8 compute modes to balance numerical accuracy and hardware throughput.
SENet Implementations - Implements the Squeeze-and-Excitation Network architecture for high-performance inference on graphics processors.
ShuffleNet Implementations - Runs the ShuffleNetV2 architecture using custom plugins for channel shuffling and chunking.
YOLO12 Implementations - Implements YOLO12 network architectures optimized for high-performance inference on graphics processors.
Multi-GPU Execution Scaling - Distributes model execution across multiple devices by creating independent contexts and streams for each processor.
LeNet-5 Implementations - Executes a lightweight convolutional neural network for basic image classification using hardware acceleration.
Normalization-Enhanced Architectures - Executes IBN-Net architectures to enhance learning through combined batch and instance normalization.
MobileNet Implementations - Executes MobileNet architectures on graphics processors to achieve low-latency image classification.
YOLOv1 Implementations - Implements the YOLOv1 network architecture optimized for graphics processor acceleration.
Optical Character Recognition - Executes convolutional recurrent neural networks to perform high-performance optical character recognition.
Road Boundary Detection - Provides hardware-accelerated execution of the PSENet boundary detection network.
Road Scene Parsing - Implements a full pipeline for segmenting road scenes using the PSENet architecture.
Semantic Segmentation - Performs pixel-level image labeling using high-resolution networks to identify object boundaries.
Text Detection Algorithms - Implements the DBNet architecture for high-performance text region detection in images.
Layer Construction - Constructs transformer encoder blocks using primitives for normalization, attention, and feed-forward networks.
Vision Transformer Implementations - Constructs high-performance ViT architectures using hardware primitives for accelerated inference.
Image Classification Architectures - Implements the Inception v1 architecture for image classification and feature extraction on graphics processors.
MnasNet Implementations - Executes the MnasNet architecture using depth multipliers and group convolutions for image classification.
UNet Implementations - Executes image segmentation using the UNet architecture optimized for graphics processors.
Object Detection and Segmentation - Identifies and delineates objects within images using specialized deep learning architectures for detection and segmentation.
YOLOv3 Implementations - Provides a high-performance implementation of the YOLOv3 architecture for real-time object detection.
YOLOv3-SPP Implementations - Implements the YOLOv3-SPP architecture with support for fixed and dynamic input shapes.
YOLOv3-Tiny Implementations - Implements a lightweight YOLOv3-Tiny architecture for efficient, low-latency object detection.
YOLOv10 Implementations - Implements YOLOv10 network architectures optimized for high-performance inference on graphics processors.
YOLOv9 Implementations - Implements various YOLOv9 model scales and architectures optimized for real-time inference.
Text Recognition - Converts images of text into machine-readable strings using optimized CRNN architectures.
Custom Inference Operations - Provides plugin-based solutions for complex neural network operations not natively supported by the inference engine.
Dynamic Inference Batching - Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.
Variable Input Shape Support - Processes input data with varying batch sizes and dimensions without requiring a static input size.
Lane-Based Road Representations - Provides a high-performance implementation of ultra-fast lane detection for graphics processors.
Human Pose Detections - Detects and tracks human body keypoints to reconstruct poses using specialized network configurations.

dusty-nv/jetson-inference

8,734View on GitHub

jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti

BVLC/caffe

34,576View on GitHub

Caffe is a high-performance deep learning framework designed for training and deploying deep neural networks. It functions as a machine learning engine and a convolutional neural network library, providing a C++ backend to accelerate computations on both GPUs and CPUs. The system includes a specialized toolset for computer vision, enabling tasks such as object detection, semantic segmentation, and large-scale image retrieval. It supports the deployment of pre-trained models for image and scene recognition, as well as the ability to fine-tune neural network weights for specialized tasks. The

TingsongYu/PyTorch_Tutorial

8,018View on GitHub

This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene

openvinotoolkit/openvino

10,414View on GitHub

OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and

wang-xinyutensorrtx

Features