# google-research/big_vision

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/google-research-big-vision).**

3,363 stars · 211 forks · Jupyter Notebook · apache-2.0

## Links

- GitHub: https://github.com/google-research/big_vision
- awesome-repositories: https://awesome-repositories.com/repository/google-research-big-vision.md

## Description

This project is a research framework and toolkit designed for training large-scale vision transformers and multimodal language models. It provides a comprehensive suite for vision-language pretraining, enabling the development of models that map images and text into shared latent spaces.

The framework is distinguished by its capabilities in high-fidelity image generation and multimodal research, utilizing normalizing flows and variational autoencoders to produce images from text prompts or class labels. It supports the development of both generative and contrastive models, allowing for a wide range of vision-language tasks.

The toolkit covers a broad surface of computer vision perception and generative workflows, including panoptic segmentation, depth estimation, and zero-shot classification. It includes infrastructure for distributed model training with parameter sharding across multi-host TPU and GPU clusters, as well as data engineering pipelines for scalable dataset loading and preprocessing.

The system manages model lifecycles through pre-trained weight initialization, fine-tuning scripts, and automated evaluation management.

## Tags

### Artificial Intelligence & ML

- [Distributed Training Sharding](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-sharding.md) — Distributes model parameters and optimizer states across multi-host TPU and GPU clusters to enable training of massive architectures. ([source](https://github.com/google-research/big_vision/blob/main/README.md))
- [Large Scale Training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-training.md) — Trains massive vision transformer models across distributed TPU and GPU hardware using sharded parameters. ([source](https://github.com/google-research/big_vision#readme))
- [Dataset Integration](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-integration.md) — Integrates large-scale image and text datasets into the model training pipeline. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/clippo/README.md))
- [Dataset Preparation Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-preparation-utilities.md) — Includes scripts for downloading and formatting large-scale external image datasets for model consumption. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/uvim/README.md))
- [Decoder Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/decoder-architectures.md) — Utilizes a decoder-only transformer architecture for autoregressive multimodal sequence generation.
- [Transformer Embedding Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-extraction-models/transformer-embedding-extraction.md) — Generates high-dimensional embeddings from images for use in multimodal language models or classification tasks. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md))
- [Feature Fusion Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-fusion-architectures.md) — Implements architectural patterns for merging visual features and textual tokens into a shared sequence.
- [Text-to-Image Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-pipelines/text-to-image-generators.md) — Produces high-fidelity images from text prompts using a normalizing flow model as a decoder. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/jetformer))
- [Multi-Node Training Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-model-deployments/multi-node-training-scaling.md) — Distributes training across single or multi-host setups using GPUs and TPUs. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/uvim/README.md))
- [Image Data Preprocessing](https://awesome-repositories.com/f/artificial-intelligence-ml/image-data-preprocessing.md) — Defines preprocessing sequences including decoding, cropping, and resizing for image datasets. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/gsam/vit_i1k_gsam_no_aug.py))
- [Image Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/image-generation.md) — Produces high-fidelity images from text prompts or class labels using normalizing flows and variational autoencoders.
- [Conditional Image Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/image-generation-models/conditional-image-generation.md) — Generates synthetic images guided by class labels using latent sequences from a VAE. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/givt/README.md))
- [Contrastive Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/vision-transformer-pre-training/contrastive-pre-training.md) — Trains and evaluates models that learn shared representations between images and text through contrastive learning. ([source](https://github.com/google-research/big_vision/blob/main/README.md))
- [Pre-trained Model Checkpoints](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/vision-transformer-pre-training/pre-trained-model-checkpoints.md) — Loads pre-trained weights for various architectures and scales to serve as backbones for experiments. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/flexivit/README.md))
- [Data Engineering Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-training/data-engineering-pipelines.md) — Implements scalable data loading and engineering pipelines for processing massive datasets. ([source](https://github.com/google-research/big_vision#readme))
- [TPU Training Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/tpu-training-accelerators.md) — Coordinates large-scale training workloads across clusters of tensor processing units.
- [Vision Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-frameworks/vision-model-training.md) — Executes machine learning experiments using configurable architectures and training schedules on distributed hardware. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/givt/givt_imagenet2012.py))
- [Vision-Language Pretraining](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-frameworks/vision-model-training/vision-language-training/vision-language-pretraining.md) — Develops contrastive and generative models that map images and text into shared latent spaces during pretraining.
- [Model Training Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-toolkits.md) — Provides a comprehensive framework for sharding parameters and managing pipelines to train massive neural networks.
- [Vector-Quantized VAEs](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training/variational-autoencoders/vector-quantized-vaes.md) — Employs vector-quantized variational autoencoders to represent images as discrete codewords or latent vectors.
- [Multimodal Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-models.md) — Develops architectures that process and generate both text and images using shared embeddings. ([source](https://github.com/google-research/big_vision#readme))
- [Normalizing Flow Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/normalizing-flow-encoders.md) — Uses normalizing flow encoders to transform image data into high-fidelity latent representations.
- [Multilingual Image-Text Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/text-model-training/caption-based-training/multilingual-image-text-alignment.md) — Maps images and text into a shared space using captioning-based pretraining and self-supervised losses. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md))
- [Vision Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-transformers.md) — Implements scaling and deployment of vision transformer architectures across distributed GPU and TPU clusters.
- [Encoder-Decoder Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-transformers/encoder-decoder-architectures.md) — Builds large-scale vision architectures using encoder-decoder blocks and multi-head attention for image patches. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/models/proj/cappa/cappa.py))
- [Depth Estimation](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-pose-estimations/monocular-depth-estimators/multi-view-depth-estimators/depth-estimation.md) — Predicts pixel-dense depth information from images by processing real-valued latent representations. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/givt/README.md))
- [Panoptic Segmentation](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation/panoptic-segmentation.md) — Identifies and segments all image instances by combining a VAE and a transformer. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/givt/README.md))
- [Scalable Generative AI Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-training/scalable-generative-ai-model-training.md) — Trains large-scale decoder-only transformers to generate and understand both text and images by maximizing data likelihood. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/jetformer))
- [Sharpness-Aware Minimization](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-computation/sharpness-aware-minimization.md) — Calculates gradients using sharpness-aware minimization to improve model generalization. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/trainers/proj/gsam/gsam.py))
- [Variable Resolution Handling](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-optimizations/aspect-ratio-optimizers/variable-resolution-handling.md) — Processes images with varying resolutions and aspect ratios to maintain visual fidelity. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md))
- [Multimodal Perception Models](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/architectures/multimodal-perception-models.md) — Performs complex visual perception tasks by integrating normalizing flow encoders within transformer architectures. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/jetformer))
- [Computer Vision](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/frameworks/computer-vision.md) — Enables solving visual perception tasks such as panoptic segmentation and depth estimation using pretrained vision backbones.
- [Vision Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/fine-tuning-and-alignment/fine-tuning-frameworks/vision-model-fine-tuning.md) — Adapts large pretrained vision and language models to specific downstream datasets through transfer learning and distillation.
- [Model Distillation Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/model-distillation-tools.md) — Transfers knowledge from large teacher models to smaller student models to improve efficiency. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/distill/README.md))
- [Multi-Task Vision Training](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-frameworks/vision-model-training/multi-task-vision-training.md) — Trains unified vision models for segmentation, colorization, and depth prediction using a guiding code approach. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/uvim))
- [Vision-Language Fine-Tunings](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-frameworks/vision-model-training/vision-language-training/vision-language-fine-tunings.md) — Transfers a pretrained base model to tasks like captioning or detection through targeted training. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md))
- [Multilingual Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-inference.md) — Enables zero-shot predictions and interactive queries across various languages and visual tasks. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md))
- [Pre-trained Model Transfer](https://awesome-repositories.com/f/artificial-intelligence-ml/pre-trained-model-transfer.md) — Adapts a pre-trained vision model to a new dataset by resetting the output head. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/transfer.py))
- [Real-Valued Generative Models](https://awesome-repositories.com/f/artificial-intelligence-ml/real-valued-generative-models.md) — Produces continuous real-valued entries using multivariate Gaussian mixture models instead of discrete tokens. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/givt/README.md))
- [Sharpness-Aware Minimization](https://awesome-repositories.com/f/artificial-intelligence-ml/sharpness-aware-minimization.md) — Implements sharpness-aware minimization to improve the generalization of large-scale models.
- [Dynamic Image Patching](https://awesome-repositories.com/f/artificial-intelligence-ml/spatiotemporal-patching/dynamic-image-patching.md) — Dynamically adjusts vision model patch sizes to maintain weight integrity during model reuse. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/flexivit/README.md))
- [Dynamic Embedding Resizing](https://awesome-repositories.com/f/artificial-intelligence-ml/spatiotemporal-patching/dynamic-image-patching/dynamic-embedding-resizing.md) — Adjusts image patch dimensionality dynamically to maintain weight compatibility across different resolutions.
- [Zero-Shot Classification Models](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-classification-models.md) — Categorizes images into classes without specific label training by computing embeddings from pretrained models. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/clippo/README.md))

### Part of an Awesome List

- [Multimodal Large Language Models](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-large-language-models.md) — Serves as a research framework for training large-scale multimodal models that process images and text.
- [Multimodal Pretraining](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-pretraining.md) — Trains multimodal models across stages to increase resolution and sequence length. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md))
- [Vision Model Fine-Tuning](https://awesome-repositories.com/f/awesome-lists/ai/model-training-and-fine-tuning/vision-model-fine-tuning.md) — Adapts pre-trained vision models to new datasets through a dedicated transfer script and configuration system. ([source](https://github.com/google-research/big_vision#readme))

### Data & Databases

- [Training Data Pipelines](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/ml-data-pipelines/training-data-pipelines.md) — Provides scalable pipelines for loading and preprocessing images and text into model-ready formats for training. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/configs/vit_s16_i1k.py))
- [Vision Dataset Loading](https://awesome-repositories.com/f/data-databases/vision-dataset-loading.md) — Integrates standardized image and question-answering datasets for training and evaluating large-scale vision models. ([source](https://github.com/google-research/big_vision/blob/main/big_vision/datasets))
