InternVL

InternVL is a vision-language model framework that fuses a visual encoder with a large language model to translate image features into textual tokens for reasoning. It provides a system for multimodal inference and dialogue, enabling the processing of images and text to answer questions or generate descriptions.

The project is distinguished by its high-resolution image processing, which uses dynamic tiling to maintain detail for images up to 4K resolution, and its chain-of-thought visual reasoning for solving complex mathematical and spatial problems. It also supports temporal frame sampling for video understanding and provides zero-shot capabilities for image classification and multilingual cross-modal retrieval.

The framework covers a broad range of capabilities including optical character recognition, object localization, and semantic image segmentation. It supports distributed multimodal training and fine-tuning via low-rank adaptation, as well as performance optimizations such as weight quantization and model distillation.

Deployment is supported through an OpenAI-compatible REST interface, a web-based chat interface, and a command-line interface with multi-GPU layer distribution.

Features

Vision-Language Models - Fuses a visual encoder with a large language model to translate image features into textual tokens for reasoning.

Multimodal Inference - Provides a framework for multimodal inference, processing images and text to generate descriptions or answer questions.

Chain-of-Thought Modules - Provides step-by-step logical derivations to solve complex mathematical and spatial problems in visual contexts.

Visual Mathematical Reasoning - Applies chain-of-thought reasoning to solve complex quantitative problems based on visual and textual inputs.

Representative Frame Sampling - Supports temporal frame sampling to understand events within video content.

Image Tiling - Uses dynamic tiling to divide high-resolution images into adaptive segments, supporting detailed analysis up to 4K resolution.

Dynamic Tiling - Uses dynamic tiling to maintain detail for images up to 4K resolution.

Multimodal Dialogue and Interaction - Supports interactive conversations that use one or more images as visual context for the dialogue.

Visual - Responds to natural language questions about images using chart analysis, document extraction, and external knowledge.

Adapter Fine-Tuning - Implements Low-Rank Adaptation to update a small subset of parameters for memory-efficient model adaptation.

OpenAI-Compatible APIs - Exposes model inference via standard OpenAI-compatible HTTP endpoints for external client integration.

Text-Based Object Localization - Identifies the bounding box coordinates of specific regions in an image based on text descriptions.

Conversation State Managers - Maintains interaction history and context across a sequence of multimodal exchanges to support follow-up questions.

Distributed Training - Supports distributed multimodal fine-tuning across multiple compute nodes and GPUs to optimize performance.

Domain-Specific Reasoning Evaluation - Evaluates high-level reasoning performance in specialized fields including academic papers and scientific diagrams.

Hallucination Detection - Implements mechanisms to detect and score the tendency of models to produce descriptions of non-existent objects.

Model Fine-Tuning - Provides procedures for adapting pre-trained models to specific datasets or tasks using parameter fine-tuning.

Model Distillation Frameworks - Runs distilled versions of multimodal models to reduce memory requirements for consumer-grade hardware.

Multi-GPU Distribution - Splits model layers across multiple GPUs to execute parameters exceeding single-device memory capacity.

Model Quantization - Applies eight-bit quantization to lower the memory footprint during the inference process.

Long Multimodal Contexts - Handles extended sequences of images and text using flexible position encoding to maintain long-term context.

Optical Character Recognition - Recognizes and extracts text content from images using end-to-end optical character recognition.

Parameter Efficient Fine-Tuning - Supports low-rank adaptation (LoRA) to efficiently fine-tune pre-trained weights on multimodal datasets.

Precision Quantization - Compresses model weights to four-bit or eight-bit precision to reduce memory footprint and accelerate inference.

Preference Optimization - Implements preference optimization to align model reasoning by learning relative quality between response pairs.

Weight Quantization - The project uses four-bit quantization to accelerate inference speed and reduce the total amount of graphics memory required.

Synthetic Reasoning Data Generators - Generates synthetic reasoning data pairs by sampling negative responses to improve model logical capabilities.

Training Configurations - Manages training data distribution through sampling frequency, image tiling, and conditional augmentation.

Visual Data Reasoning Evaluation - Evaluates logical and arithmetic reasoning based on visual data representations like charts and diagrams.

Visual Mathematical Reasoning Evaluation - Tests the model's ability to solve mathematical problems and logical puzzles presented within visual contexts.

Visual Question Answering Evaluation - Tests the ability to answer questions based on images, incorporating external knowledge and specialized content.

Visual Spatial Reasoning Evaluation - Tests the understanding of physical environments, object locations, and spatial relations in real-world scenes.

Image Captioning - Generates descriptive text summaries for images in a zero-shot manner without requiring task-specific training.

Video Understanding - Provides benchmarks and capabilities for assessing temporal comprehension and sequential visual data processing in long-form videos.

OCR Accuracy Evaluators - Tests the accuracy of recognizing and extracting text from documents, infographics, and handwritten math expressions.

Multimodal Training Data Formatters - Implements JSONL-based data formatting to support text, single-image, multi-image, and video inputs for training.

Multi-GPU Layer Distribution - Splits model layers across multiple graphics cards to execute large models that exceed single-device memory.

Inference Load Balancers - Distributes model workers across multiple GPUs to balance inference load and optimize performance.

Comparative Visual Analysis - Enables the processing of multiple images in one prompt for comparative and collective visual analysis.

Video Analysis and Processing - Analyzes sequences of video frames to understand temporal content and answer questions about the video.

3D Spatial Reasoning - Perceives and reasons about three-dimensional visual information to understand complex spatial relationships.

Vision-Language Model Benchmarking - Measures accuracy across vision-language benchmarks using standardized evaluation kits and chain-of-thought prompting.

Web Chat Interfaces - Provides a browser-based chat interface by launching a coordinated controller, worker, and server.

Frontier Reasoning Models - Multimodal reasoning model with advanced training recipes.

Multimodal Foundation Models - Scalable vision-language foundation model for generic tasks.

Multimodal LLM Models - Multimodal model achieving high performance on multidisciplinary benchmarks.

Multimodal Models - Large-scale multimodal model for visual and textual reasoning.

Vision Language Models - Advanced multimodal series using a scalable ViT-MLP-LLM architecture.

OpenGVLabInternVL

Features

Star history