1 repo
Techniques for mapping multi-modal data into shared vector spaces by maximizing similarity between paired samples.
Distinguishing note: Focuses on the alignment of visual and textual embeddings via contrastive loss, distinct from general feature extraction.
Explore 1 awesome GitHub repository matching artificial intelligence & ml · Contrastive Learning Models. Refine with filters or upvote what's useful.
CLIP is a neural network architecture designed to map visual and textual data into a shared latent vector space. By utilizing transformer-based feature extraction and multi-modal tokenization, the system aligns images and natural language strings, enabling cross-modal similarity analysis and semantic classification. The project functions as a zero-shot classification engine, identifying image content by calculating the cosine similarity between visual features and arbitrary text labels without requiring task-specific retraining. Beyond inference, it serves as a research toolkit for evaluating
Maps visual and textual data into a shared vector space by maximizing the similarity of paired samples during training.