YOLO World

YOLO-World is a vision-language framework and open-vocabulary object detection model. It identifies objects in images and video based on free-form text prompts without requiring predefined category labels.

The system enables the identification of arbitrary objects by fusing image features with text embeddings. It includes a specialized tool for automated image labeling, which generates bounding box annotations for custom datasets using text-based prompts.

The project provides a deployment pipeline for converting models into quantized ONNX and TFLite formats, supporting real-time inference on resource-constrained edge hardware. It also includes a fine-tuning adaptation framework to adapt pre-trained models to custom domains through prompt or reparameterized tuning.

Features

Open-Vocabulary Object Detection - Implements an open-vocabulary object detection model that identifies arbitrary objects using free-form text prompts.

Open-Vocabulary Detection - Implements an open-vocabulary detection pipeline that identifies arbitrary objects using text embeddings instead of fixed labels.

Vision-Language Cross-Attention Fusions - Fuses visual features with text embeddings through cross-attention mechanisms to enable open-vocabulary object recognition.

Vision-Language Models - Utilizes a vision-language model architecture that fuses image features with text embeddings for object recognition.

Vision-Language Fine-Tunings - Adapts pre-trained vision-language models to custom domains using specialized fine-tuning methods.

YOLO Object Detectors - Employs a real-time object detection system based on the YOLO architecture optimized for low-latency inference.

2D Object Labeling - Ships a tool that automatically generates 2D bounding box annotations using text-based prompts.

Structural Reparameterizations - Utilizes structural reparameterization to adapt pre-trained models to custom domains without sacrificing inference speed.

Edge Object Detection - Provides object detection and tracking optimized for deployment on resource-constrained edge hardware.

Edge AI Runtimes - Provides a runtime optimized for executing detection and tracking on personal and edge devices.

Image Inference Clients - Processes single images, directories, or video files to detect objects described by text prompts.

Edge AI Model Deployment - Optimizes and deploys real-time object detection to run efficiently on local hardware and edge devices.

ONNX Model Exporters - Provides utilities to export the detection model into the standardized ONNX format for cross-platform deployment.

TFLite Model Exporters - Converts trained detectors into ONNX and TFLite formats for deployment on servers and edge devices.

Automated Image Labeling - Generates bounding box annotations for vision datasets using text descriptions to automate image labeling.

Fine-Tuning Frameworks - Offers a framework supporting normal, prompt, and reparameterized fine-tuning to adapt models to custom domains.

Conversion-Time Quantizers - Implements quantization during the model conversion process to shrink weights to 8-bit integers for edge inference.

ONNX and TFLite Model Exporters - Provides a deployment pipeline to convert detection models into quantized ONNX and TFLite formats for edge hardware.

TFLite Exports - Converts models to TFLite format using INT8 quantization for efficient mobile deployment.

Real-Time Video Analysis - Processes live video streams with low-latency pipelines for immediate object detection and tracking.

Real-Time Model Inference on Frames - Processes images and video frames through a streamlined pipeline optimized for real-time, low-latency performance.

Computer Vision - Real-time open-vocabulary object detection.

Object Detection - Listed in the “Object Detection” section of the The Incredible Pytorch awesome list.

AILab-CVCYOLO-World

Features

Star history