# opengvlab/internvl

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/opengvlab-internvl).**

10,061 stars · 783 forks · Python · MIT

## Links

- GitHub: https://github.com/OpenGVLab/InternVL
- Homepage: https://internvl.readthedocs.io/en/latest/
- awesome-repositories: https://awesome-repositories.com/repository/opengvlab-internvl.md

## Topics

`gpt` `gpt-4o` `gpt-4v` `image-classification` `image-text-retrieval` `llm` `multi-modal` `semantic-segmentation` `video-classification` `vision-language-model` `vit-22b` `vit-6b`

## Description

InternVL is a vision-language model framework that fuses a visual encoder with a large language model to translate image features into textual tokens for reasoning. It provides a system for multimodal inference and dialogue, enabling the processing of images and text to answer questions or generate descriptions.

The project is distinguished by its high-resolution image processing, which uses dynamic tiling to maintain detail for images up to 4K resolution, and its chain-of-thought visual reasoning for solving complex mathematical and spatial problems. It also supports temporal frame sampling for video understanding and provides zero-shot capabilities for image classification and multilingual cross-modal retrieval.

The framework covers a broad range of capabilities including optical character recognition, object localization, and semantic image segmentation. It supports distributed multimodal training and fine-tuning via low-rank adaptation, as well as performance optimizations such as weight quantization and model distillation.

Deployment is supported through an OpenAI-compatible REST interface, a web-based chat interface, and a command-line interface with multi-GPU layer distribution.

## Tags

### Artificial Intelligence & ML

- [Vision-Language Models](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/multimodal-processing-tools/vision-language-models.md) — Fuses a visual encoder with a large language model to translate image features into textual tokens for reasoning.
- [Chain-of-Thought Modules](https://awesome-repositories.com/f/artificial-intelligence-ml/chain-of-thought-modules.md) — Provides step-by-step logical derivations to solve complex mathematical and spatial problems in visual contexts.
- [Visual Mathematical Reasoning](https://awesome-repositories.com/f/artificial-intelligence-ml/complex-problem-solving/visual-mathematical-reasoning.md) — Applies chain-of-thought reasoning to solve complex quantitative problems based on visual and textual inputs. ([source](https://internvl.readthedocs.io/en/latest/internvl3.0/introduction.html))
- [Representative Frame Sampling](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-modeling/temporal-sequence-processors/representative-frame-sampling.md) — Supports temporal frame sampling to understand events within video content.
- [Image Tiling](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/image-tiling.md) — Uses dynamic tiling to divide high-resolution images into adaptive segments, supporting detailed analysis up to 4K resolution. ([source](https://internvl.readthedocs.io/en/latest/internvl1.1/introduction.html))
- [Dynamic Tiling](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/image-tiling/dynamic-tiling.md) — Uses dynamic tiling to maintain detail for images up to 4K resolution.
- [Adapter Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/adapter-fine-tuning.md) — Implements Low-Rank Adaptation to update a small subset of parameters for memory-efficient model adaptation. ([source](https://internvl.readthedocs.io/en/latest/tutorials/coco_caption_finetune.html))
- [OpenAI-Compatible APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/model-integration-serving/model-integration-interfaces/ai-integration-apis/openai-compatible-apis.md) — Exposes model inference via standard OpenAI-compatible HTTP endpoints for external client integration.
- [Text-Based Object Localization](https://awesome-repositories.com/f/artificial-intelligence-ml/bounding-box-regression/bounding-box-representations/bounding-box-coordinate-predictors/text-based-object-localization.md) — Identifies the bounding box coordinates of specific regions in an image based on text descriptions. ([source](https://internvl.readthedocs.io/en/latest/internvl2.0/introduction.html))
- [Conversation State Managers](https://awesome-repositories.com/f/artificial-intelligence-ml/conversation-state-managers.md) — Maintains interaction history and context across a sequence of multimodal exchanges to support follow-up questions. ([source](https://internvl.readthedocs.io/en/latest/internvl1.5/deployment.html))
- [Distributed Training](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-frameworks/distributed-training.md) — Supports distributed multimodal fine-tuning across multiple compute nodes and GPUs to optimize performance. ([source](https://internvl.readthedocs.io/en/latest/internvl1.0/internvl_chat_llava.html))
- [Domain-Specific Reasoning Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/domain-specific-reasoning-evaluation.md) — Evaluates high-level reasoning performance in specialized fields including academic papers and scientific diagrams. ([source](https://internvl.readthedocs.io/en/latest/internvl1.1/evaluation.html))
- [Hallucination Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/hallucination-detection.md) — Implements mechanisms to detect and score the tendency of models to produce descriptions of non-existent objects. ([source](https://internvl.readthedocs.io/en/latest/internvl1.1/evaluation.html))
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning.md) — Provides procedures for adapting pre-trained models to specific datasets or tasks using parameter fine-tuning. ([source](https://internvl.readthedocs.io/en/latest/tutorials/faqs.html))
- [Model Distillation Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-distillation-frameworks.md) — Runs distilled versions of multimodal models to reduce memory requirements for consumer-grade hardware. ([source](https://internvl.readthedocs.io/en/latest/internvl1.5/introduction.html))
- [Multi-GPU Distribution](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits/distributed-deployment-utilities/multi-gpu-distribution.md) — Splits model layers across multiple GPUs to execute parameters exceeding single-device memory capacity.
- [Model Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/quantization/model-quantization.md) — Applies eight-bit quantization to lower the memory footprint during the inference process. ([source](https://internvl.readthedocs.io/en/latest/internvl1.5/quick_start.html))
- [Long Multimodal Contexts](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-processing/long-multimodal-contexts.md) — Handles extended sequences of images and text using flexible position encoding to maintain long-term context. ([source](https://internvl.readthedocs.io/en/latest/internvl3.0/introduction.html))
- [Optical Character Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition.md) — Recognizes and extracts text content from images using end-to-end optical character recognition. ([source](https://cdn.jsdelivr.net/gh/opengvlab/internvl@main/README.md))
- [Parameter Efficient Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/parameter-efficient-fine-tuning.md) — Supports low-rank adaptation (LoRA) to efficiently fine-tune pre-trained weights on multimodal datasets.
- [Precision Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/precision-quantization.md) — Compresses model weights to four-bit or eight-bit precision to reduce memory footprint and accelerate inference.
- [Preference Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/preference-optimization.md) — Implements preference optimization to align model reasoning by learning relative quality between response pairs. ([source](https://internvl.readthedocs.io/en/latest/internvl2.0/preference_optimization.html))
- [Weight Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/quantized-inference-runtimes/weight-quantization.md) — The project uses four-bit quantization to accelerate inference speed and reduce the total amount of graphics memory required. ([source](https://internvl.readthedocs.io/en/latest/tutorials/faqs.html))
- [Synthetic Reasoning Data Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/synthetic-data-generators/synthetic-reasoning-data-generators.md) — Generates synthetic reasoning data pairs by sampling negative responses to improve model logical capabilities. ([source](https://internvl.readthedocs.io/en/latest/internvl2.0/preference_optimization.html))
- [Training Configurations](https://awesome-repositories.com/f/artificial-intelligence-ml/training-configurations.md) — Manages training data distribution through sampling frequency, image tiling, and conditional augmentation. ([source](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html))
- [Visual Data Reasoning Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-data-reasoning-evaluation.md) — Evaluates logical and arithmetic reasoning based on visual data representations like charts and diagrams. ([source](https://internvl.readthedocs.io/en/latest/internvl1.2/evaluation.html))
- [Visual Mathematical Reasoning Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-mathematical-reasoning-evaluation.md) — Tests the model's ability to solve mathematical problems and logical puzzles presented within visual contexts. ([source](https://internvl.readthedocs.io/en/latest/internvl1.5/evaluation.html))
- [Visual Question Answering Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-question-answering-evaluation.md) — Tests the ability to answer questions based on images, incorporating external knowledge and specialized content. ([source](https://internvl.readthedocs.io/en/latest/internvl1.1/evaluation.html))
- [Visual Spatial Reasoning Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-spatial-reasoning-evaluation.md) — Tests the understanding of physical environments, object locations, and spatial relations in real-world scenes. ([source](https://internvl.readthedocs.io/en/latest/internvl1.2/evaluation.html))

### Part of an Awesome List

- [Multimodal Inference](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-inference.md) — Provides a framework for multimodal inference, processing images and text to generate descriptions or answer questions. ([source](https://internvl.readthedocs.io/en/latest/internvl1.5/quick_start.html))
- [Multimodal Dialogue and Interaction](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-dialogue-and-interaction.md) — Supports interactive conversations that use one or more images as visual context for the dialogue. ([source](https://cdn.jsdelivr.net/gh/opengvlab/internvl@main/README.md))
- [Visual](https://awesome-repositories.com/f/awesome-lists/ai/question-answering/visual.md) — Responds to natural language questions about images using chart analysis, document extraction, and external knowledge. ([source](https://cdn.jsdelivr.net/gh/opengvlab/internvl@main/README.md))
- [Image Captioning](https://awesome-repositories.com/f/awesome-lists/ai/image-captioning.md) — Generates descriptive text summaries for images in a zero-shot manner without requiring task-specific training. ([source](https://internvl.readthedocs.io/en/latest/internvl1.0/internvl_g.html))
- [Video Understanding](https://awesome-repositories.com/f/awesome-lists/ai/video-understanding.md) — Provides benchmarks and capabilities for assessing temporal comprehension and sequential visual data processing in long-form videos. ([source](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html))
- [OCR Accuracy Evaluators](https://awesome-repositories.com/f/awesome-lists/more/text-extraction-and-ocr/ocr-accuracy-evaluators.md) — Tests the accuracy of recognizing and extracting text from documents, infographics, and handwritten math expressions. ([source](https://internvl.readthedocs.io/en/latest/internvl1.2/evaluation.html))
- [Frontier Reasoning Models](https://awesome-repositories.com/f/awesome-lists/ai/frontier-reasoning-models.md) — Multimodal reasoning model with advanced training recipes.
- [Multimodal Foundation Models](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-foundation-models.md) — Scalable vision-language foundation model for generic tasks.
- [Multimodal LLM Models](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-llm-models.md) — Multimodal model achieving high performance on multidisciplinary benchmarks.
- [Multimodal Models](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-models.md) — Large-scale multimodal model for visual and textual reasoning.
- [Vision Language Models](https://awesome-repositories.com/f/awesome-lists/ai/vision-language-models.md) — Advanced multimodal series using a scalable ViT-MLP-LLM architecture.

### Data & Databases

- [Multimodal Training Data Formatters](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/ml-data-pipelines/training-data-pipelines/multimodal-training-data-formatters.md) — Implements JSONL-based data formatting to support text, single-image, multi-image, and video inputs for training. ([source](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html))
- [Multi-GPU Layer Distribution](https://awesome-repositories.com/f/data-databases/memory-optimization-strategies/compositor-memory-limits/model-layer-offloading/multi-gpu-layer-distribution.md) — Splits model layers across multiple graphics cards to execute large models that exceed single-device memory. ([source](https://internvl.readthedocs.io/en/latest/internvl1.5/quick_start.html))

### DevOps & Infrastructure

- [Inference Load Balancers](https://awesome-repositories.com/f/devops-infrastructure/traffic-load-balancers/inference-load-balancers.md) — Distributes model workers across multiple GPUs to balance inference load and optimize performance. ([source](https://internvl.readthedocs.io/en/latest/get_started/local_chat_demo.html))

### Graphics & Multimedia

- [Comparative Visual Analysis](https://awesome-repositories.com/f/graphics-multimedia/image-editing-processing/image-analysis-tools/comparative-visual-analysis.md) — Enables the processing of multiple images in one prompt for comparative and collective visual analysis. ([source](https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html))
- [Video Analysis and Processing](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing/video-analysis-processing.md) — Analyzes sequences of video frames to understand temporal content and answer questions about the video. ([source](https://internvl.readthedocs.io/en/latest/internvl2.0/introduction.html))

### Hardware & IoT

- [3D Spatial Reasoning](https://awesome-repositories.com/f/hardware-iot/embedded-robotics/robotics-autonomous-systems/visual-scene-interpreters/3d-spatial-reasoning.md) — Perceives and reasons about three-dimensional visual information to understand complex spatial relationships. ([source](https://internvl.readthedocs.io/en/latest/internvl3.0/introduction.html))

### Software Engineering & Architecture

- [Vision-Language Model Benchmarking](https://awesome-repositories.com/f/software-engineering-architecture/performance-reliability/performance-engineering/performance-benchmarking/vision-language-model-benchmarking.md) — Measures accuracy across vision-language benchmarks using standardized evaluation kits and chain-of-thought prompting. ([source](https://internvl.readthedocs.io/en/latest/internvl1.0/internvl_chat_llava.html))

### User Interface & Experience

- [Web Chat Interfaces](https://awesome-repositories.com/f/user-interface-experience/web-chat-interfaces.md) — Provides a browser-based chat interface by launching a coordinated controller, worker, and server. ([source](https://internvl.readthedocs.io/en/latest/internvl1.0/internvl_chat_llava.html))