Otter is a framework and toolkit for the pretraining, fine-tuning, and evaluation of vision-language models. It provides a pipeline for training large language models to process high-resolution images and video frames, integrating visual encoders with textual token spaces.
The system is designed for multi-visual input processing, allowing models to interpret multiple images or video sequences within a single prompt. It supports multi-round conversation management to maintain context across interactions for detailed scene comprehension and visual reasoning.
The framework covers a full development lifecycle, including foundational pretraining, supervised fine-tuning, and visual instruction tuning. It also includes a dedicated evaluation suite to measure reasoning accuracy and performance when processing combined visual and textual data.