Open-source software frameworks and pre-trained models for identifying and localizing objects within digital images.
YOLOv7 is a PyTorch vision library and real-time inference engine designed for object detection, human pose estimation, and instance segmentation. It provides a framework for detecting and locating multiple objects within images or video streams using neural networks. The system includes tools for custom model training and fine-tuning, allowing pre-trained weights to be adapted to specialized datasets via transfer learning. It also supports model weight export and format conversion to facilitate deployment on production servers and embedded edge devices.
YOLOv7 is a comprehensive computer vision library that provides pre-trained models, real-time inference capabilities, and GPU-accelerated object detection, making it a flagship tool for this category.
YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning to high-speed inference and deployment. The framework utilizes a modular neural architecture, allowing users to swap backbone and head components to tailor models for specific visual tasks. What distinguishes this project is its focus on production-ready deployment and model efficiency. It includes a robust model export engine that converts trained networks into standardized formats, enabling high-performance execution across diverse hardware, including edge devices and web browsers. To optimize models for resource-constrained environments, the framework offers advanced techniques such as neural network pruning, weight sparsity, and mixed-precision training, alongside tools for benchmarking performance and fine-tuning pruned models. The platform supports a highly configurable training pipeline that leverages parallel processing and dynamic data augmentation to improve model robustness. Users can manage complex training workflows through externalized configuration files, which decouple model logic from dataset structures. The system also provides sophisticated inference capabilities, including test-time augmentation and model ensembling, to balance detection accuracy with processing latency requirements.
YOLOv5 is a comprehensive computer vision framework that provides pre-trained models, real-time inference, and GPU-accelerated object detection, making it a flagship tool for this category.
PaddleDetection is an object detection framework designed for the end-to-end development, training, and deployment of computer vision models. It provides a comprehensive library of modular neural network architectures and pipelines that support object detection, instance segmentation, and multi-object tracking tasks. The project distinguishes itself through a configuration-driven approach that decouples model components like backbones and heads, allowing for the flexible assembly of custom vision workflows. It incorporates advanced techniques such as anchor-free detection logic, joint detection-embedding architectures for tracking, and knowledge distillation to improve student model efficiency. To ensure consistent performance in real-time scenarios, the framework includes temporal prediction smoothing and multi-scale feature aggregation. The toolkit covers a broad capability surface, including automated training schedules, distributed training support, and extensive data augmentation strategies. It provides specialized tools for analyzing human and vehicle activity, estimating poses, and monitoring traffic patterns. Users can optimize models for diverse environments through quantization, pruning, and export options for standardized inference runtimes. The repository includes a model zoo of pre-trained architectures and supports deployment across server, mobile, and edge hardware via C++ and hardware-accelerated runtimes.
PaddleDetection is a comprehensive object detection framework that provides a wide array of pre-trained models, supports real-time inference with GPU acceleration, and includes tools for both training and deploying vision pipelines.
Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification. By utilizing a modular architecture, the platform allows users to swap model components to balance inference speed and accuracy requirements for diverse applications. The framework distinguishes itself through its support for real-time processing and flexible deployment. It includes a streaming inference engine that manages memory usage for large-scale video analysis and a format-agnostic export pipeline that translates trained weights into standardized formats for edge and cloud environments. Beyond standard detection, it supports open-vocabulary segmentation, allowing users to identify objects using text or visual prompts, and provides robust multi-object tracking capabilities to maintain identity persistence across video frames. The platform covers the entire machine learning lifecycle, from dataset retrieval and dynamic data loading to performance benchmarking and experiment tracking. It includes specialized tools for annotating visual results and accessing structured output data, facilitating integration into automated inspection and monitoring workflows. Users can configure training hyperparameters, resume interrupted sessions, and profile model performance to ensure optimal deployment on hardware ranging from mobile devices to high-performance GPUs.
This framework provides a comprehensive suite of pre-trained YOLO models, real-time inference capabilities, and GPU-accelerated deployment tools that directly address all the requirements for object detection.
OpenCV is an open-source computer vision library and visual analysis toolkit. It provides a framework for processing static images and dynamic video frames to analyze visual data and extract information using deep learning. The project functions as a real-time image processing framework, enabling the execution of vision algorithms on live video streams for immediate analysis and data processing. The toolkit covers a broad range of capabilities including image pattern recognition, real-time video analysis, and visual data extraction. It also supports automated visual inspection for detecting defects or changes through image analysis.
OpenCV is a comprehensive computer vision library that provides the foundational tools, pre-trained model support, and GPU-accelerated processing required for real-time object detection and image analysis.
This project provides a deep learning architecture designed to identify and isolate distinct objects within images by generating precise pixel-level masks. It functions as a browser-based inference engine, enabling the execution of complex machine learning models directly within web environments without requiring server-side processing. The system distinguishes itself by utilizing hardware-accelerated execution and parallel processing to achieve real-time segmentation speeds. It supports prompt-based mask decoding, allowing users to generate spatial masks by providing specific points or boxes as inputs. Additionally, the framework includes an image embedding pipeline that converts raw visual data into compact numerical representations, facilitating efficient analysis and downstream task performance. The toolkit encompasses a suite of model optimization utilities that convert and compress machine learning models into standardized, portable formats. These capabilities ensure consistent performance across diverse hardware environments while maintaining high-performance execution through multithreaded memory sharing.
This project is a specialized image segmentation framework focused on pixel-level masking rather than the bounding-box object detection requested, making it a related but distinct computer vision tool.
GoCV is a computer vision library and Go language binding for OpenCV. It serves as an image processing toolkit and deep learning inference engine, providing programmatic access to a wide range of algorithms for image manipulation, object detection, and video analysis. The project differentiates itself through high-performance native bindings and hardware acceleration. It utilizes a foreign function interface to map Go calls to C++ functions and includes a hardware-agnostic backend dispatch to route neural network tasks to computation engines such as CUDA and OpenVINO. The library covers a broad surface of visual analysis capabilities, including camera calibration and correction, feature detection, and marker recognition for QR codes and ArUco markers. It provides tools for object tracking, human pose estimation, and geometric shape analysis. Additionally, it handles fundamental image processing tasks like color space conversion, noise reduction, and matrix operations, alongside GUI window management for interactive visualization. The project supports static binary linking and provides multi-architecture container images to simplify the installation of vision libraries and GPU-accelerated environments.
GoCV provides the necessary bindings to OpenCV's deep learning inference engine, enabling object detection with GPU acceleration and pre-trained model support within the Go ecosystem.
MMDetection3D is an open-source toolbox for 3D perception, providing a unified framework for detecting and segmenting objects in three-dimensional environments. It supports a range of core tasks including monocular 3D object detection from single camera images, LiDAR-based 3D object detection from raw point clouds, and multi-modal fusion that combines camera images with LiDAR data. The toolbox also covers point cloud semantic segmentation, assigning class labels to every point in a scan for scene understanding. The project distinguishes itself through a config-driven pipeline that orchestrates the entire training, evaluation, and inference workflow, with support for distributed training across multiple GPUs and machines. It includes a registry-based module composition system for assembling custom models from encoder, backbone, neck, head, and loss components, and provides built-in support for sparse convolution acceleration using libraries like spconv and MinkowskiEngine. The toolbox also offers a unified dataset format conversion system that transforms raw sensor data from benchmarks such as KITTI, Waymo, and nuScenes into a standardized internal structure, along with checkpoint-based training resumption and mixed precision training for fault-tolerant and efficient workflows. Beyond its core detection and segmentation capabilities, the project provides a comprehensive set of tools for data preparation, augmentation, and evaluation. It includes data structuring for LiDAR, multi-modal, and vision-based detection tasks, point cloud augmentation techniques, and dataset-specific evaluation protocols with metrics like mean Average Precision. The toolbox also supports model deployment, leaderboard submission for autonomous driving benchmarks, and integration with over 500 pre-trained 2D detection models from a shared codebase. Installation is available via pip or the MIM tool, and the project can be run in Docker containers or on Windows for cross-platform compatibility.
This is a comprehensive framework for 3D object detection and segmentation that provides pre-trained models, GPU acceleration, and robust tools for processing complex sensor data, though it focuses on 3D environments rather than standard 2D image object detection.
Kornia is a differentiable computer vision library and cross-framework tensor vision toolset. It implements vision operations as differentiable tensors to enable integration into deep learning pipelines and supports the transpilation of operations across PyTorch, TensorFlow, JAX, and NumPy. The project provides specialized toolsets for geometric vision and stereo depth, including algorithms for 3D scene reconstruction, camera calibration, and pose estimation. It further distinguishes itself as a differentiable image augmentation framework, applying random geometric and color transformations while maintaining gradient flow. The library covers a broad range of capabilities including 3D spatial analysis, image registration and stitching, and visual feature analysis. It also includes tools for optical flow computation, image topology analysis, and the integration of multimodal vision-language frameworks. Vision pipelines and pre-trained models can be converted into ONNX format for cross-platform hardware inference.
Kornia is a comprehensive differentiable computer vision library that provides the foundational tensor operations and geometric tools necessary to build object detection pipelines, though it functions more as a low-level framework for vision tasks than a turnkey object detection application.
This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management. The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computations to specialized hardware such as GPUs, NPUs, and dedicated chipsets. It also provides a comprehensive toolkit for model optimization, including quantization, precision conversion, and graph-level transformations, which allow for significant reductions in binary size and latency for both edge and cloud deployments. Beyond core inference, the project includes extensive support for generative AI, offering built-in capabilities for tokenization, chat template formatting, and streaming output generation. It supports complex model architectures through custom operator registration and modular adapter management, ensuring that developers can integrate specialized mathematical operations or fine-tuned model weights into their pipelines. The software is built primarily in C++ and provides language-specific bindings to facilitate integration into various programming environments. It includes robust diagnostic and profiling tools that allow for granular performance analysis, hardware utilization tracking, and debugging of tensor data during the inference process.
This is a high-performance inference engine for executing machine learning models, but it is a foundational runtime rather than a specialized library that provides pre-trained object detection models or specific computer vision tools.
BiRefNet is a PyTorch image segmentation framework designed for high-precision binary mask generation. It functions as a bilateral image segmentation model used to isolate foreground objects from complex backgrounds, as well as a specialized tool for camouflaged object detection and industrial defect detection. The project is designed for export to the ONNX format, which facilitates cross-platform deployment and inference. It supports custom model fine-tuning on user-provided image and mask datasets to adapt the model for specialized professional use cases. The system covers high-resolution image processing for dichotomous segmentation and automated quality control for industrial inspection. It includes utilities for model accuracy evaluation using standard metrics across benchmark datasets.
This framework focuses on high-precision binary image segmentation and foreground isolation rather than the bounding box-based object detection requested.
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabilities, including real-time video analytics, object detection and tracking, and image segmentation. It also integrates hardware-accelerated decoding and TensorRT-based inference to optimize model execution on embedded platforms. The project provides a TensorRT inference wrapper and an embedded vision SDK to facilitate the deployment of neural network primitives.
This library provides a comprehensive suite of tools for real-time object detection and inference specifically optimized for embedded GPU hardware, making it a strong choice for edge-based computer vision tasks.
waifu2x-ncnn-vulkan is an AI super-resolution tool and image processor that uses deep learning to increase image resolution and remove visual noise. It is an NCNN-based implementation designed for efficient neural network inference on local hardware. The project utilizes the Vulkan API to provide GPU-accelerated image scaling and noise reduction across diverse graphics hardware. It employs tiled image processing to prevent GPU memory overflow and multi-threaded model loading to reduce initial startup latency. The software covers functional domains including AI image upscaling for maintaining sharpness, image denoising to remove artifacts, and batch processing for converting multiple files across directories.
This tool is designed for image super-resolution and denoising rather than object detection, making it a related deep learning inference utility but not a library for identifying and locating objects.
This project is a collection of optional, community-contributed algorithms and specialized vision tools that extend the core OpenCV framework. It serves as a comprehensive library of extra modules for computer vision research, providing advanced toolsets for image processing, visual data analysis, and object detection. The library includes specialized frameworks for augmented reality tracking, biometric face recognition, and three-dimensional pose estimation. It provides distinct capabilities for identifying AR markers, tracking 3D object silhouettes, and performing neural network vulnerability analysis through adversarial input perturbations. The project covers a broad range of high-level capabilities, including camera calibration and spatial alignment, image quality enhancement and super-resolution upscaling, and temporal motion analysis via optical flow. It also provides utilities for data visualization, typography rendering, and the management of large-scale vision datasets.
This repository provides a vast collection of specialized computer vision modules that extend the core OpenCV framework, including support for model inference and various object detection algorithms.