This project provides a transformer-based object detection model that treats the task as a direct set prediction problem. It implements a vision system capable of predicting bounding boxes and class labels for objects within an image, as well as frameworks for instance and panoptic segmentation.
The architecture utilizes a transformer encoder and decoder to perform end-to-end set prediction, employing a Hungarian matcher to assign predicted boxes to ground truth objects. It incorporates a convolutional backbone for feature extraction and a system of learnable object queries to probe image locations.
The project includes capabilities for distributed training across multiple GPUs and compute nodes, as well as tools for computing accuracy metrics such as Average Precision. It also provides utilities for bounding box coordinate conversion and the integration of pre-trained backbones and external datasets.