# idea-research/groundingdino

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/idea-research-groundingdino).**

9,738 stars · 995 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/IDEA-Research/GroundingDINO
- Homepage: https://arxiv.org/abs/2303.05499
- awesome-repositories: https://awesome-repositories.com/repository/idea-research-groundingdino.md

## Topics

`object-detection` `open-world` `open-world-detection` `vision-language` `vision-language-transformer`

## Description

GroundingDINO is a deep learning vision model and open-vocabulary object detector designed to map natural language prompts to spatial coordinates. It functions as a text-to-bounding-box framework that enables zero-shot image localization, allowing the system to identify and locate arbitrary objects without requiring predefined classes or specific training for those categories.

The project distinguishes itself by matching visual features to natural language descriptions to achieve open-set visual recognition. It supports text-guided image localization and the isolation of specific objects based on phrase-specific similarity scores.

The system includes capabilities for filtering detections based on confidence thresholds and tools for evaluating localization accuracy against standard datasets. Additionally, it provides integration logic to link spatial coordinates with generative models for controllable image editing.

## Tags

### Artificial Intelligence & ML

- [Open-Vocabulary Object Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/open-vocabulary-object-detection.md) — The library identifies and locates arbitrary objects within an image by matching visual features to provided natural language descriptions. ([source](https://github.com/IDEA-Research/GroundingDINO#readme))
- [Open-Vocabulary Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-inference/open-vocabulary-detection.md) — Enables the detection of any arbitrary object based on natural language descriptions without predefined classes.
- [Text-to-Bounding-Box Models](https://awesome-repositories.com/f/artificial-intelligence-ml/bounding-box-regression/bounding-box-representations/bounding-box-coordinate-predictors/text-to-bounding-box-models.md) — Converts natural language prompts into precise spatial coordinates within an image to identify objects.
- [Contrastive Learning Models](https://awesome-repositories.com/f/artificial-intelligence-ml/contrastive-learning-models.md) — Aligns visual and textual representations in a shared vector space using contrastive learning loss.
- [Object Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/object-detection.md) — Locates arbitrary objects in images using natural language descriptions rather than fixed pretrained categories.
- [Open-Vocabulary Detectors](https://awesome-repositories.com/f/artificial-intelligence-ml/object-detection/small-object-detectors/open-vocabulary-detectors.md) — Functions as an open-vocabulary object detector that locates arbitrary items via natural language matching.
- [Text-Guided Image Localization](https://awesome-repositories.com/f/artificial-intelligence-ml/text-guided-image-localization.md) — Finds the exact coordinates of specific items within an image by matching visual features to provided text prompts.
- [Zero-Shot Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-inference.md) — Identifies bounding boxes for items without requiring predefined classes or task-specific training.
- [Attention Mechanisms](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms.md) — Implements cross-attention mechanisms to align specific text tokens with corresponding visual regions in images.
- [Phrase-Specific Isolation](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-detection-tracking/object-detection/phrase-specific-isolation.md) — Extracts precise object locations by targeting the highest text similarity scores for specific words within a sentence. ([source](https://github.com/IDEA-Research/GroundingDINO#readme))
- [Deep Learning Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/deep-learning-architectures.md) — Implements a deep learning architecture that integrates language and image embeddings for open-set recognition.
- [Feature Fusion Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-fusion-architectures.md) — Fuses image and text embeddings into a shared latent space to correlate visual features with language.
- [Object Query Mechanisms](https://awesome-repositories.com/f/artificial-intelligence-ml/object-query-mechanisms.md) — Utilizes learnable object queries and a transformer decoder to iteratively refine bounding box predictions.
- [Sequence Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-decoding-models/sequence-decoders.md) — Employs a sequence decoder to predict spatial coordinates and class labels from multimodal features.