LLaVA

Features

Multimodal Large Language Models - Processes both image and text inputs to generate coherent natural language responses based on visual context.
Vision-Language Pipelines - Provides scripts and configurations for fine-tuning language models to understand visual features.
Visual Instruction Tuning - Aligns language models with human intent by training on diverse datasets containing paired images and text queries.
Model Fine-Tuning - Refines base language models for visual tasks by adjusting training parameters and managing save points.
Multimodal Training - Refines large language models to process and interpret visual data through fine-tuning.
Inference Servers - Coordinates model workers and controllers to serve visual reasoning predictions across hardware.
Instruction Fine-tuning - Improves responsiveness to multi-modal commands by training on datasets pairing images with text instructions.
Instruction Tuning Pipelines - Refines language models using curated datasets of image-text pairs to improve multi-modal command following.
Feature Alignment - Connects visual data to language models by training a projection layer on image-caption pairs.
Multimodal Foundation Models - Foundational visual instruction tuning framework.
Multimodal Instruction Models - Comprehensive framework for visual instruction tuning.
Multimodal Learning - Provides a comprehensive study of visual instruction tuning.
Vision Language Models - Integrates language models with visual encoders for instruction tuning.
Voice & Multimodal Assistants - Large language-and-vision assistant for multimodal capabilities.
Vision Language Model - Listed in the “Vision Language Model” section of the Ailia Models awesome list.
Inference Deployment - Hosts model workers on private infrastructure to run complex image recognition and reasoning tasks.
Distributed Orchestration - Coordinates a pool of independent model workers to distribute inference tasks across multiple hardware nodes.
Model Evaluation - Measures accuracy and performance of visual reasoning systems by comparing outputs against ground truth datasets.
Inference Runtimes - Loads model weights into memory to process inputs through a synchronous command-line interface.
Vision Encoders - Converts raw image pixels into high-dimensional latent representations for downstream processing.
Benchmarks - Compares generated responses against ground truth data using automated metrics.

Open-source alternatives to LLaVA

Similar open-source projects, ranked by how many features they share with LLaVA.

vision-cair/minigpt-4
Vision-CAIR/MiniGPT-4
25,679View on GitHub
MiniGPT-4 is a multimodal AI framework and large language model that integrates vision encoders with language models to process and reason about combined image and text inputs. It functions as a vision-language model capable of image-based conversational AI, visual question answering, and multimodal logical reasoning. The project utilizes a pretrained vision-language integration strategy that connects a vision encoder to a language model via a linear projection layer. This approach employs frozen-backbone training to align visual representations with linguistic tokens while keeping the primar
Python
View on GitHub25,679
qwenlm/qwen2-vl
QwenLM/Qwen2-VL
19,404View on GitHub
Qwen2-VL is a multimodal large language model and vision language model designed to process and reason across text, images, and video content. It functions as a visual reasoning engine and a visual agent framework, capable of interpreting visual data to perform object detection, document parsing, and spatial reasoning. The model is distinguished by its ability to act as a video understanding model, processing hour-long videos with second-level indexing and event recall. It further differentiates itself through a visual agent capability that interacts with software interfaces and robotic hardw
Jupyter Notebook
View on GitHub19,404
microsoft/unilm
microsoft/unilm
22,030View on GitHub
This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout data through a shared set of weights, enabling researchers and developers to build foundational models that align cross-modal representations. The platform distinguishes itself through advanced training and inference strategies designed for large-scale deep learning. It incorporates specialized mec
Pythonbeitbeit-3bitnet
View on GitHub22,030
evolvinglmms-lab/otter
EvolvingLMMs-Lab/Otter
3,331View on GitHub
Otter is a framework and toolkit for the pretraining, fine-tuning, and evaluation of vision-language models. It provides a pipeline for training large language models to process high-resolution images and video frames, integrating visual encoders with textual token spaces. The system is designed for multi-visual input processing, allowing models to interpret multiple images or video sequences within a single prompt. It supports multi-round conversation management to maintain context across interactions for detailed scene comprehension and visual reasoning. The framework covers a full develop
Pythonartificial-inteligencechatgptdeep-learning
View on GitHub3,331

See all 30 alternatives to LLaVA

haotian-liuLLaVA

Features

Open-source alternatives to LLaVA

Vision-CAIR/MiniGPT-4

QwenLM/Qwen2-VL

microsoft/unilm

EvolvingLMMs-Lab/Otter

Star history

Open-source alternatives to LLaVA

Vision-CAIR/MiniGPT-4

QwenLM/Qwen2-VL

microsoft/unilm

EvolvingLMMs-Lab/Otter