Training Frameworks - Provides an open-source framework for building and fine-tuning small vision-language models.
Image-Text Prompt Inferences - Replaces image placeholder tokens in a text prompt with projected visual features to generate responses.
Image-Text Prompt Inferences - Generates descriptive or conversational responses from image-text prompts by replacing image placeholder tokens.
Two-Stage Fine-Tuning Pipelines - Trains the projection layer alone on image-caption pairs, then jointly fine-tunes projection and selected LLM layers.
Partial Layer Fine-Tunings - Updates only selected transformer layers while keeping the visual encoder and remaining LLM layers frozen.
Vision-Language Training - Trains a multimodal model that processes images and text together by adding a visual encoder and projection layer.
From-Scratch Trainings - Builds a multimodal model from scratch by adding a visual encoder and projection layer to a small language model.
Vision-Language Fine-Tunings - Fine-tunes a pretrained vision-language model by training only the projection layer and selected LLM layers.
Visual Tokenizers - Converts input images into patch tokens via a frozen encoder and projects them into the language model's embedding space.
Projection Layers - Maps visual patch tokens into the language model's embedding space using a trainable projection layer.