LAVIS is a multimodal large language model framework and vision-language model library. It provides tools for training and evaluating models that integrate visual, textual, and audio data, serving as a cross-modal feature extractor and a zero-shot visual reasoning engine. The framework distinguishes itself by using frozen-backbone integration, where pretrained encoders remain non-trainable while lightweight adapter layers are updated. It employs cross-modal feature alignment to map different representations into a shared embedding space and utilizes a modular model wrapper to swap vision and
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
This project is a research library and toolkit for deep learning computer vision, focused on implementing transformer and mixer-based architectures for image classification. It processes visual data by converting images into sequences of patches, allowing standard attention mechanisms to capture global dependencies without relying on traditional convolutional operations. The framework distinguishes itself through its support for multimodal embedding analysis, which maps images and text into a shared latent vector space. This capability enables zero-shot classification and cross-modal retrieva
ImageBind is a multi-modal embedding model and joint representation learner that maps images, text, audio, and other modalities into a single shared vector space. It functions as a cross-modal retrieval framework designed to bind multiple sensory inputs into one cohesive mathematical embedding. The system uses a contrastive learning architecture to align disparate data types by maximizing the similarity between related samples. This allows the model to perform zero-shot multimodal classification and execute cross-modal data retrieval, such as locating visual content via natural language descr