# idea-research/grounded-segment-anything

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/idea-research-grounded-segment-anything).**

17,633 stars · 1,593 forks · Jupyter Notebook · Apache-2.0

## Links

- GitHub: https://github.com/IDEA-Research/Grounded-Segment-Anything
- Homepage: https://arxiv.org/abs/2401.14159
- awesome-repositories: https://awesome-repositories.com/repository/idea-research-grounded-segment-anything.md

## Topics

`3d-whole-body-pose-estimation` `automatic-labeling-system` `caption` `data-generation` `image-editing` `open-vocabulary-detection` `open-vocabulary-segmentation` `speech`

## Description

Grounded-Segment-Anything is a suite of specialized tools for multimodal visual analysis, text-based segmentation, and generative image editing. It integrates text-to-bounding-box detection and high-precision image segmentation masks to function as a text-based image segmenter and an automated visual labeling tool.

The project enables text-driven image editing by identifying objects through natural language to perform inpainting and element replacement. It further extends visual analysis into three dimensions, allowing for 3D human reconstruction and the generation of 3D bounding boxes from text prompts.

The system covers a broad range of computer vision capabilities, including zero-shot visual recognition, object detection, and the automated generation of pseudo-labels for large-scale datasets. It also provides interfaces for conversational visual analysis and audio-driven object segmentation.

## Tags

### Artificial Intelligence & ML

- [Language-Based Segmentation](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation/language-based-segmentation.md) — Provides high-precision image segmentation by interpreting natural language prompts to isolate specific objects. ([source](https://github.com/idea-research/grounded-segment-anything#readme))
- [Multimodal Analysis Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-analysis-tools.md) — Provides a framework that processes text, audio, and images to perform object detection and 3D mesh estimation.
- [Text-Based Object Localization](https://awesome-repositories.com/f/artificial-intelligence-ml/bounding-box-regression/bounding-box-representations/bounding-box-coordinate-predictors/text-based-object-localization.md) — Maps natural language descriptions to specific 2D spatial coordinates to locate objects in an image.
- [3D](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-detection-tracking/object-detection/3d.md) — Extends two-dimensional segmentation masks into three-dimensional bounding boxes by projecting image coordinates.
- [Text-Prompted Masking](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation/object-mask-generators/text-prompted-masking.md) — Produces precise masks for objects described in text by combining object detection with segmentation. ([source](https://github.com/idea-research/grounded-segment-anything#readme))
- [Prompt-Based Masking](https://awesome-repositories.com/f/artificial-intelligence-ml/image-generation/image-editing/generative-masking/prompt-based-masking.md) — Generates precise pixel-level segmentation masks by feeding grounding coordinates into a pre-trained foundation model.
- [Multimodal AI Pipeline Orchestration](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-ai-pipeline-orchestration.md) — Chains together speech-to-text, object detection, and segmentation models into a unified multimodal processing chain.
- [Vision-Language Grounding Models](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-grounding-models.md) — Implements a pipeline that maps natural language prompts to spatial bounding boxes for object grounding.
- [3D Bounding Box Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/bounding-box-detection/3d-bounding-box-generation.md) — Extends 2D segmentation prompts into a 3D environment to produce three-dimensional object bounding boxes. ([source](https://github.com/idea-research/grounded-segment-anything#readme))
- [Conversational Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/conversational-interfaces.md) — Implements a conversational interface allowing users to describe images, detect objects, and replace elements via a chatbot. ([source](https://github.com/idea-research/grounded-segment-anything#readme))
- [Text-Guided Inpainting](https://awesome-repositories.com/f/artificial-intelligence-ml/image-generation/image-editing/generative-masking/text-guided-inpainting.md) — Substitutes target objects identified by text with new generated objects using diffusion-based inpainting. ([source](https://github.com/idea-research/grounded-segment-anything#readme))
- [Visual Conversational Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-conversational-analysis.md) — Uses a chat interface to identify, describe, and label objects within images based on natural language prompts.
- [Zero-Shot Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-inference.md) — Identifies and isolates arbitrary objects without requiring class-specific training by leveraging pre-trained weights.

### Part of an Awesome List

- [Image Editing](https://awesome-repositories.com/f/awesome-lists/ai/image-editing.md) — Replaces or modifies specific objects in a visual asset by identifying them through text and applying inpainting.
- [3D Human Mesh Recovery](https://awesome-repositories.com/f/awesome-lists/ai/3d-human-mesh-recovery.md) — Tracks people in images using text prompts to recover their full 3D body pose and shape. ([source](https://github.com/idea-research/grounded-segment-anything#readme))
- [Human Reconstruction](https://awesome-repositories.com/f/awesome-lists/ai/human-reconstruction.md) — Recovers a person's full 3D body pose and shape by tracking them in an image via a text prompt.

### Graphics & Multimedia

- [Generative Image Editing Tools](https://awesome-repositories.com/f/graphics-multimedia/generative-image-editing-tools.md) — Enables text-driven object identification and replacement using inpainting and latent diffusion models.
- [Latent Inpainting Masks](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/face-portrait-manipulation/image-masking/face-mask-generation/latent-inpainting-masks.md) — Replaces identified image regions with new content by masking segments and sampling from a latent diffusion model.

### Data & Databases

- [Automated Labelers](https://awesome-repositories.com/f/data-databases/asset-management/automated-labelers.md) — Automatically generates bounding boxes and masks for large image datasets to create pseudo labels. ([source](https://github.com/idea-research/grounded-segment-anything#readme))
- [Model-Assisted Labelers](https://awesome-repositories.com/f/data-databases/label-based-data-selection/metadata-labelers/model-assisted-labelers.md) — Automatically creates image pseudo-labels, bounding boxes, and masks using recognition and captioning models.
