What are the best open-source alternatives to Visualbert?

30 open-source projects similar to uclanlp/visualbert, ranked by shared features. Top picks: salesforce/lavis, qwenlm/qwen2.5-vl, airsplay/lxmert, salesforce/albef, jackroos/vl-bert, facebookresearch/imagebind, opengvlab/llama-adapter, evolvinglmms-lab/otter, google-research/big_vision, chenrocks/uniter.

Is salesforce/lavis a good alternative to Visualbert?

LAVIS is a multimodal large language model framework and vision-language model library. It provides tools for training and evaluating models that integrate visual, textual, and audio data, serving as a cross-modal feature extractor and a zero-shot visual reasoning engine. The framework distinguish…

Is qwenlm/qwen2.5-vl a good alternative to Visualbert?

Qwen2.5-VL is an autoregressive multimodal transformer designed to process interleaved sequences of text and visual tokens. It integrates visual feature embeddings into a shared language model space to perform cross-modal reasoning and generate coherent responses or structured layout code. The pro…

Is airsplay/lxmert a good alternative to Visualbert?

Our servers break again :(. I have updated the links so that they should work fine now. Sorry for the inconvenience. Please let me for any further issues. Thanks! --Hao, Dec 03

Is salesforce/albef a good alternative to Visualbert?

This is the official PyTorch implementation of the ALBEF paper [Blog] . This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints…

Is jackroos/vl-bert a good alternative to Visualbert?

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

Is facebookresearch/imagebind a good alternative to Visualbert?

ImageBind is a multi-modal embedding model and joint representation learner that maps images, text, audio, and other modalities into a single shared vector space. It functions as a cross-modal retrieval framework designed to bind multiple sensory inputs into one cohesive mathematical embedding. Th…

Is opengvlab/llama-adapter a good alternative to Visualbert?

LLaMA-Adapter is a parameter-efficient fine-tuning framework designed to adapt large language models using a minimal set of trainable parameters. It functions as an instruction tuning tool and a multimodal adapter, allowing pre-trained models to follow human instructions and process non-textual dat…

Is evolvinglmms-lab/otter a good alternative to Visualbert?

Otter is a framework and toolkit for the pretraining, fine-tuning, and evaluation of vision-language models. It provides a pipeline for training large language models to process high-resolution images and video frames, integrating visual encoders with textual token spaces. The system is designed f…

Is google-research/big_vision a good alternative to Visualbert?

This project is a research framework and toolkit designed for training large-scale vision transformers and multimodal language models. It provides a comprehensive suite for vision-language pretraining, enabling the development of models that map images and text into shared latent spaces. The frame…

Is chenrocks/uniter a good alternative to Visualbert?

This is the official repository of UNITER (ECCV 2020). This repository currently supports finetuning UNITER on NLVR2, VQA, VCR, SNLI-VE, Image-Text Retrieval for COCO and Flickr30k, and Referring Expression Comprehensions (RefCOCO, RefCOCO+, and RefCOCO-g). Both UNITER-base and UNITER-large…

Back to uclanlp/visualbert

Open-source alternatives to Visualbert

30 open-source projects similar to uclanlp/visualbert, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Visualbert alternative.

salesforce/lavis
salesforce/LAVIS
11,236View on GitHub
LAVIS is a multimodal large language model framework and vision-language model library. It provides tools for training and evaluating models that integrate visual, textual, and audio data, serving as a cross-modal feature extractor and a zero-shot visual reasoning engine. The framework distinguishes itself by using frozen-backbone integration, where pretrained encoders remain non-trainable while lightweight adapter layers are updated. It employs cross-modal feature alignment to map different representations into a shared embedding space and utilizes a modular model wrapper to swap vision and
Jupyter Notebook
View on GitHub11,236
qwenlm/qwen2.5-vl
QwenLM/Qwen2.5-VL
19,480View on GitHub
Qwen2.5-VL is an autoregressive multimodal transformer designed to process interleaved sequences of text and visual tokens. It integrates visual feature embeddings into a shared language model space to perform cross-modal reasoning and generate coherent responses or structured layout code. The project distinguishes itself through vision-language-action mapping, allowing it to perceive visual interfaces and translate that perception into actionable commands for operating digital screens and robotic hardware. It employs dynamic-resolution image encoding and temporal-frame video indexing to hand
Jupyter Notebook
View on GitHub19,480
airsplay/lxmert
airsplay/lxmert
967View on GitHub
Our servers break again :(. I have updated the links so that they should work fine now. Sorry for the inconvenience. Please let me for any further issues. Thanks! --Hao, Dec 03
Python
View on GitHub967
salesforce/albef
salesforce/ALBEF
1,758View on GitHub
This is the official PyTorch implementation of the ALBEF paper Blog . This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints…
Python
View on GitHub1,758

Open-source alternatives to Visualbert

salesforce/LAVIS

QwenLM/Qwen2.5-VL

airsplay/lxmert

salesforce/ALBEF

jackroos/VL-BERT

facebookresearch/ImageBind

OpenGVLab/LLaMA-Adapter

EvolvingLMMs-Lab/Otter

google-research/big_vision

ChenRocks/UNITER

DAMO-NLP-SG/VideoLLaMA3

deepseek-ai/DeepSeek-VL

deepseek-ai/DeepSeek-VL2

deepseek-ai/Janus

descriptinc/lyrebird-wav2clip

vikhyat/moondream

facebookresearch/fairseq

apple/ml-aim

zhegan27/VILLA

google-research/google-research

haotian-liu/LLaVA

huggingface/smollm

huggingface/transformers

InternLM/InternLM-XComposer

AndreyGuzhov/AudioCLIP

jayleicn/ClipBERT

LuoweiZhou/VLP

meta-llama/llama-models

microsoft/LLM2CLIP

microsoft/unilm