30 open-source projects similar to uclanlp/visualbert, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Visualbert alternative.
LAVIS is a multimodal large language model framework and vision-language model library. It provides tools for training and evaluating models that integrate visual, textual, and audio data, serving as a cross-modal feature extractor and a zero-shot visual reasoning engine. The framework distinguishes itself by using frozen-backbone integration, where pretrained encoders remain non-trainable while lightweight adapter layers are updated. It employs cross-modal feature alignment to map different representations into a shared embedding space and utilizes a modular model wrapper to swap vision and
Qwen2.5-VL is an autoregressive multimodal transformer designed to process interleaved sequences of text and visual tokens. It integrates visual feature embeddings into a shared language model space to perform cross-modal reasoning and generate coherent responses or structured layout code. The project distinguishes itself through vision-language-action mapping, allowing it to perceive visual interfaces and translate that perception into actionable commands for operating digital screens and robotic hardware. It employs dynamic-resolution image encoding and temporal-frame video indexing to hand
Our servers break again :(. I have updated the links so that they should work fine now. Sorry for the inconvenience. Please let me for any further issues. Thanks! --Hao, Dec 03
This is the official PyTorch implementation of the ALBEF paper Blog . This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints…
Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
ImageBind is a multi-modal embedding model and joint representation learner that maps images, text, audio, and other modalities into a single shared vector space. It functions as a cross-modal retrieval framework designed to bind multiple sensory inputs into one cohesive mathematical embedding. The system uses a contrastive learning architecture to align disparate data types by maximizing the similarity between related samples. This allows the model to perform zero-shot multimodal classification and execute cross-modal data retrieval, such as locating visual content via natural language descr
LLaMA-Adapter is a parameter-efficient fine-tuning framework designed to adapt large language models using a minimal set of trainable parameters. It functions as an instruction tuning tool and a multimodal adapter, allowing pre-trained models to follow human instructions and process non-textual data. The project specializes in the integration of image, video, audio, and sensor data into language models for cross-modal understanding. It enables the customization of LLaMA models through the use of lightweight adapters, which allows for the extraction and storage of learned weights independently
Otter is a framework and toolkit for the pretraining, fine-tuning, and evaluation of vision-language models. It provides a pipeline for training large language models to process high-resolution images and video frames, integrating visual encoders with textual token spaces. The system is designed for multi-visual input processing, allowing models to interpret multiple images or video sequences within a single prompt. It supports multi-round conversation management to maintain context across interactions for detailed scene comprehension and visual reasoning. The framework covers a full develop
This project is a research framework and toolkit designed for training large-scale vision transformers and multimodal language models. It provides a comprehensive suite for vision-language pretraining, enabling the development of models that map images and text into shared latent spaces. The framework is distinguished by its capabilities in high-fidelity image generation and multimodal research, utilizing normalizing flows and variational autoencoders to produce images from text prompts or class labels. It supports the development of both generative and contrastive models, allowing for a wide
This is the official repository of UNITER (ECCV 2020). This repository currently supports finetuning UNITER on NLVR2, VQA, VCR, SNLI-VE, Image-Text Retrieval for COCO and Flickr30k, and Referring Expression Comprehensions (RefCOCO, RefCOCO+, and RefCOCO-g). Both UNITER-base and UNITER-large…
DeepSeek-VL is a multimodal large language model and image-to-text reasoning engine. It functions as a vision-language model and visual question answering system that integrates visual perception with linguistic reasoning to understand and describe images. The project enables multimodal image understanding and document image analysis, specifically processing screenshots of web pages and technical diagrams. It provides capabilities for visual conversational AI, allowing users to interact with visual data to extract insights and perform complex reasoning across different types of visual informa
DeepSeek-VL2 is a multimodal large language model and vision-language system designed to analyze visual scenes and generate descriptive text. It functions as a visual question answering and visual grounding model, capable of extracting information from documents and locating specific objects or regions within images based on textual descriptions. The project utilizes a mixture-of-experts architecture to process combined image and text inputs. It is optimized for inference through incremental prefilling, which reduces the GPU memory requirements on hardware. The model covers multimodal data a
Janus is a multimodal large language model and unified framework that integrates visual understanding and image generation within a single neural network. It functions as both a visual understanding model for analyzing images and a text-to-image generator. The system uses a unified transformer backbone and a multimodal latent space to bridge the gap between text and visual data. This architecture employs decoupled visual encoding and cross-modal tokenization to separate the paths for discriminative understanding and generative tasks, representing images as grids of discrete codes. The projec
Moondream is a small-scale vision language model designed to reason across images to generate captions and answer natural language questions. It functions as an edge-optimized system capable of performing visual question answering, image captioning, and object detection. The project distinguishes itself through a lightweight architecture designed for local inference on embedded devices, workstations, and air-gapped hardware. It supports the execution of models on local GPUs and Apple Silicon to ensure data privacy and low latency. The system's capabilities include identifying precise object
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports adversarial finetuning of UNITER on VQA, VCR, NLVR2, and SNLI-VE. Adversarial pre-training with in-domain data will be available soon. Both VILLA-base and VILLA-large pre-trained checkpoints are…
This repository serves as a comprehensive research platform and toolkit for advancing machine learning, quantum computing, and large-scale scientific data analysis. It provides foundational frameworks for developing complex algorithmic systems, offering the necessary infrastructure for distributed training, computational graph execution, and high-performance model development. The project distinguishes itself by integrating specialized research domains with robust, privacy-preserving methodologies. It supports diverse scientific discovery through tools for quantum simulation, physics-informed
LLaVA is a multimodal large language model architecture designed to process and interpret both image and text inputs to generate natural language responses. It functions as a research-oriented platform for visual instruction tuning, providing a framework to align language models with human intent through training on diverse datasets of paired images and text queries. The system distinguishes itself through a specialized vision-language training pipeline that connects visual data to language models using projection layers and instruction-based fine-tuning. It supports distributed inference by
SmolLM is a project dedicated to the development of small language models. It focuses on training and fine-tuning compact models that maintain high performance while utilizing fewer parameters. The project emphasizes efficient AI inference and on-device text generation, aiming to enable the deployment of lightweight models on edge devices with limited memory and processing power. It utilizes synthetic data generation to produce artificial datasets that improve the reasoning and training of these AI systems. The system supports a variety of optimization and training capabilities, including we
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
InternLM-XComposer-2.5
This repository contains implementation of the models described in the paper arXiv:2106.13043. This work is based on our previous works: ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021). ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020).
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
This repo hosts the source code for our AAAI2020 work Vision-Language Pre-training (VLP). We have released the pre-trained model on Conceptual Captions dataset and fine-tuned models on COCO Captions and Flickr30k for image captioning and VQA 2.0 for VQA.
This project provides a foundational framework and reference implementation for executing causal language modeling and multimodal reasoning on local systems. It includes a set of core components for managing model assets, a fine-tuning framework, and structural definitions required to instantiate transformer-based architectures. The system is distinguished by its ability to process combined text and image inputs through multimodal transformer models for visual reasoning and document analysis. It also supports the deployment of quantized models, reducing memory footprints through low-precision
Welcome to the official repository for LLM2CLIP! This project leverages large language models (LLMs) as powerful textual teachers for CLIP's visual encoder, enabling more nuanced and comprehensive multimodal learning.
This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout data through a shared set of weights, enabling researchers and developers to build foundational models that align cross-modal representations. The platform distinguishes itself through advanced training and inference strategies designed for large-scale deep learning. It incorporates specialized mec