# haotian-liu/LLaVA

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/haotian-liu-llava).**

24,465 stars · 2,731 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/haotian-liu/LLaVA
- Homepage: https://llava.hliu.cc
- awesome-repositories: https://awesome-repositories.com/repository/haotian-liu-llava.md

## Topics

`chatbot` `chatgpt` `foundation-models` `gpt-4` `instruction-tuning` `llama` `llama-2` `llama2` `llava` `multi-modality` `multimodal` `vision-language-model` `visual-language-learning`

## Description

LLaVA is a multimodal large language model architecture designed to process and interpret both image and text inputs to generate natural language responses. It functions as a research-oriented platform for visual instruction tuning, providing a framework to align language models with human intent through training on diverse datasets of paired images and text queries.

The system distinguishes itself through a specialized vision-language training pipeline that connects visual data to language models using projection layers and instruction-based fine-tuning. It supports distributed inference by coordinating a central controller with independent model workers, allowing for the deployment of visual reasoning services across local or cloud-based hardware.

The project includes comprehensive tools for visual model fine-tuning, featuring automated checkpoint-based persistence and multi-stage data pipelines. It also provides automated evaluation procedures to quantify model accuracy against ground truth datasets, alongside both command-line and web-based interfaces for interactive visual reasoning tasks.

## Tags

### Artificial Intelligence & ML

- [Multimodal Large Language Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-large-language-models.md) — Processes both image and text inputs to generate coherent natural language responses based on visual context.
- [Vision-Language Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-pipelines.md) — Provides scripts and configurations for fine-tuning language models to understand visual features.
- [Visual Instruction Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-instruction-tuning.md) — Aligns language models with human intent by training on diverse datasets containing paired images and text queries.
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/model-fine-tuning.md) — Refines base language models for visual tasks by adjusting training parameters and managing save points. ([source](https://github.com/haotian-liu/LLaVA))
- [Multimodal Training](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-training.md) — Refines large language models to process and interpret visual data through fine-tuning.
- [Inference Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-servers.md) — Coordinates model workers and controllers to serve visual reasoning predictions across hardware.
- [Instruction Fine-tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/instruction-fine-tuning.md) — Improves responsiveness to multi-modal commands by training on datasets pairing images with text instructions. ([source](https://github.com/haotian-liu/LLaVA))
- [Instruction Tuning Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/instruction-tuning-pipelines.md) — Refines language models using curated datasets of image-text pairs to improve multi-modal command following.
- [Feature Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-alignment.md) — Connects visual data to language models by training a projection layer on image-caption pairs. ([source](https://github.com/haotian-liu/LLaVA))
- [Inference Runtimes](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-runtimes.md) — Loads model weights into memory to process inputs through a synchronous command-line interface.
- [Vision Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-encoders.md) — Converts raw image pixels into high-dimensional latent representations for downstream processing.

### DevOps & Infrastructure

- [Inference Deployment](https://awesome-repositories.com/f/devops-infrastructure/inference-deployment.md) — Hosts model workers on private infrastructure to run complex image recognition and reasoning tasks.
- [Distributed Orchestration](https://awesome-repositories.com/f/devops-infrastructure/distributed-orchestration.md) — Coordinates a pool of independent model workers to distribute inference tasks across multiple hardware nodes.

### Testing & Quality Assurance

- [Model Evaluation Frameworks](https://awesome-repositories.com/f/testing-quality-assurance/model-evaluation-frameworks.md) — Measures accuracy and performance of visual reasoning systems by comparing outputs against ground truth datasets.
- [Performance Benchmarking](https://awesome-repositories.com/f/testing-quality-assurance/performance-benchmarking.md) — Compares generated responses against ground truth data using automated metrics. ([source](https://github.com/haotian-liu/LLaVA))
