This project is a research framework and toolkit designed for training large-scale vision transformers and multimodal language models. It provides a comprehensive suite for vision-language pretraining, enabling the development of models that map images and text into shared latent spaces. The framework is distinguished by its capabilities in high-fidelity image generation and multimodal research, utilizing normalizing flows and variational autoencoders to produce images from text prompts or class labels. It supports the development of both generative and contrastive models, allowing for a wide
This project is a self-supervised vision foundation model based on a vision transformer architecture. It is designed to learn dense visual representations from unlabeled images, serving as a general-purpose backbone for a wide variety of downstream vision tasks. The system is distinguished by its use of self-distillation and masked image modeling to extract semantic and geometric features. It also incorporates an image-text alignment model that maps visual embeddings to textual descriptions, enabling zero-shot image recognition, zero-shot segmentation, and cross-modal retrieval. The project
Transformers.js is a JavaScript library and web machine learning framework designed to run pretrained transformer models directly in the browser. It serves as a client-side inference engine and a wrapper for the ONNX Runtime, enabling the execution of multimodal AI tasks on user devices without the need for a backend server. The library distinguishes itself by providing a unified toolkit for processing text, image, and audio data locally. This architecture supports privacy-preserving model inference and reduces latency by performing all computations on the client's hardware. Its capabilities
This project is a multi-modal image segmentation framework and a text-to-mask vision model. It serves as a SAM-based visual segmenter designed to isolate distinct objects within images and video by converting natural language prompts and other inputs into pixel-level semantic masks. The system functions as a multi-modal image segmentation framework that integrates text, image, and audio signals to generate masks. It includes an interactive video object tracker that isolates and tracks visual entities across video frames using referring images or textual queries. The framework provides capabi