30 open-source projects similar to facebookresearch/moco, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Moco alternative.
This project is a self-supervised contrastive learning framework designed to train deep learning models to learn visual representations from images without using human-provided labels. It provides a system for developing pretrained visual representation models that can be adapted for downstream computer vision tasks. The framework includes tools for semi-supervised image classification, which combines large unlabeled datasets with small labeled sets to improve accuracy. It also features a linear probe evaluation tool to assess the quality of learned image features by training a simple linear
Lightly is a self-supervised learning framework and computer vision data curation tool designed to manage large image datasets and train models on unlabeled data. It functions as a PyTorch vision library and dataset management SDK, providing tools to convert raw images into high-dimensional vectors for similarity search, visualization, and feature extraction. The project implements a variety of self-supervised architectures, including MoCo, SimCLR, VICReg, Barlow Twins, and masked image modeling. It distinguishes itself by combining these learning frameworks with active learning capabilities,
DINOv2 is a self-supervised vision transformer foundation model designed to generate high-quality visual representations from raw image data. By leveraging large-scale unlabelled datasets, the framework learns to extract robust numerical embeddings that serve as inputs for various machine learning and analysis workflows. The model distinguishes itself through a teacher-student training framework that utilizes centered and sharpened soft probability distributions to align feature maps across multiple image crops. It incorporates a masking strategy that forces the model to reconstruct missing i
This project is a PyTorch vision transformer framework designed for self-supervised learning. It implements a model that trains visual representations using a momentum teacher and self-distillation without the need for labeled data. The library functions as an image feature extractor and visual attention visualizer, allowing for the generation of high-dimensional vectors and the rendering of self-attention maps as heatmaps or videos to analyze model focus. It provides comprehensive tools for downstream vision evaluation, including linear probe classification, k-nearest neighbor categorizatio
This is a PyTorch self-supervised learning framework designed to train models that learn visual representations from video. It implements a joint-embedding predictive architecture that extracts spatio-temporal features by predicting missing regions of a signal within a latent representation space rather than reconstructing raw pixels. The project includes a latent space visualization tool that uses a conditional diffusion model to decode feature-space predictions back into pixels. This allows for the verification of learned representations by transforming abstract predictions into interpretab
Instructor-embedding is a natural language processing framework designed to transform unstructured text into high-dimensional numerical vectors. By utilizing a transformer-based encoder architecture, the system facilitates semantic retrieval, data classification, and similarity analysis across large datasets. The framework distinguishes itself through instruction-conditioned vector projection, which incorporates natural language instructions directly into the embedding process to improve performance for specific tasks without requiring additional training. It functions as a contrastive learni
This is a PyTorch implementation of a text-to-image model designed for synthesizing high-fidelity images from natural language descriptions. It utilizes a diffusion image generator to transform latent embeddings into visual data through an iterative denoising process. The system employs a two-stage latent mapping process, using a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels. It features a cascading diffusion decoder that produces high-resolution imagery by passing low-resolution outputs through a sequence of models at increasing scales.
This project is a deep learning research toolkit and generative model library providing implementations of Variational Autoencoders using the PyTorch framework. It serves as a framework for training and evaluating autoencoder architectures to learn latent representations for data reconstruction and the generation of synthetic data samples. The toolkit focuses on unsupervised feature learning and generative model training, featuring a system for mapping external configuration files to model hyperparameters to ensure reproducible experimental runs. It includes mechanisms for tracking training p
CV-Backbones is a computer vision backbone library and model zoo providing a collection of pre-defined neural network architectures for extracting visual features and processing image data. It serves as a PyTorch vision framework of reusable deep learning components designed for image analysis and visual representation learning. The library focuses on efficient neural network architectures to reduce computational overhead while maintaining feature extraction performance. This is achieved through the implementation of lightweight model designs such as GhostNet and MLP. The project covers a br
CLIP is a neural network architecture designed to map visual and textual data into a shared latent vector space. By utilizing transformer-based feature extraction and multi-modal tokenization, the system aligns images and natural language strings, enabling cross-modal similarity analysis and semantic classification. The project functions as a zero-shot classification engine, identifying image content by calculating the cosine similarity between visual features and arbitrary text labels without requiring task-specific retraining. Beyond inference, it serves as a research toolkit for evaluating
ImageBind is a multi-modal embedding model and joint representation learner that maps images, text, audio, and other modalities into a single shared vector space. It functions as a cross-modal retrieval framework designed to bind multiple sensory inputs into one cohesive mathematical embedding. The system uses a contrastive learning architecture to align disparate data types by maximizing the similarity between related samples. This allows the model to perform zero-shot multimodal classification and execute cross-modal data retrieval, such as locating visual content via natural language descr
This project is a self-supervised vision foundation model based on a vision transformer architecture. It is designed to learn dense visual representations from unlabeled images, serving as a general-purpose backbone for a wide variety of downstream vision tasks. The system is distinguished by its use of self-distillation and masked image modeling to extract semantic and geometric features. It also incorporates an image-text alignment model that maps visual embeddings to textual descriptions, enabling zero-shot image recognition, zero-shot segmentation, and cross-modal retrieval. The project
This project is a framework for training and deploying transformer-based models that map text, images, audio, and video into dense or sparse vector representations. It functions as a multimodal embedding library and semantic search engine used to retrieve relevant documents by calculating vector similarity between meanings. The framework provides specialized tools for both cross-encoder reranking, which calculates precise similarity scores to refine search results, and vector quantization to compress embedding vectors for reduced memory usage and increased retrieval speed. The project covers
This project is a comprehensive machine learning interview guide and technical study resource designed for individuals preparing for machine learning and AI engineering roles. It provides a collection of materials and practice problems covering core algorithms, theoretical fundamentals, and the implementation of neural network architectures. The resource serves as a technical reference for generative AI development, focusing on the design and optimization of large language models and diffusion systems. It includes frameworks for system design, covering the architecture of production machine l
GroundingDINO is a deep learning vision model and open-vocabulary object detector designed to map natural language prompts to spatial coordinates. It functions as a text-to-bounding-box framework that enables zero-shot image localization, allowing the system to identify and locate arbitrary objects without requiring predefined classes or specific training for those categories. The project distinguishes itself by matching visual features to natural language descriptions to achieve open-set visual recognition. It supports text-guided image localization and the isolation of specific objects base
Chinese-CLIP is a multimodal framework and vision-language model designed for cross-modal retrieval and representation generation using Chinese text and images. It employs a contrastive learning architecture to map visual and textual data into a shared vector space for similarity calculations. The system enables bidirectional search, allowing for text-to-image and image-to-text retrieval. It also provides zero-shot image classification, which identifies objects within images without requiring task-specific training. The project includes tools for fine-tuning pre-trained models on specialized
This is a PyTorch library and framework for self-supervised vision learning. It provides an implementation of masked autoencoders and vision transformers designed to learn image representations by reconstructing masked image patches from unlabeled data. The project features a distributed training pipeline that scales workloads across multiple GPU nodes. This infrastructure includes multi-node orchestration and gradient accumulation to manage large batch sizes and coordinate resource requests across clusters. The toolkit covers a complete workflow from self-supervised masked pre-training to d
vjepa2 is a joint-embedding predictive architecture and video self-supervised learning framework. It functions as a visual representation learner and a robotic manipulation model designed to learn representations by predicting future latent states without reconstructing pixels. The system enables the pretraining of video encoders that learn temporally consistent features through masked-token prediction and multi-modal tokenization. It further maps these latent embeddings to specific physical movements via action-conditioned post-training to plan and execute robot arm grasping and picking task
This project is a collection of deep learning research implementations and a reproduction kit designed to translate theoretical AI papers into working code. It provides a library of neural network architectures and reference implementations for reproducing seminal research concepts through interactive notebooks. The repository distinguishes itself through the implementation of AI theory and scaling laws, covering complexity dynamics, information theory, and the simulation of universal AI agents. It also includes a benchmarking suite for synthetic reasoning, allowing for the evaluation of mode
Torch7 is a scientific computing environment and tensor computation library used for deep learning research and numerical analysis. It functions as a Lua-based framework for training neural networks and learning agents, providing a toolkit for implementing architectures and training through reinforcement learning algorithms. The project is distinguished by its tight integration with C, utilizing a binding layer to map high-level scripting to low-level C structures for direct memory access. It supports hardware-accelerated computation by offloading linear algebra and convolution operations to
This project is a transformer-based framework for generating dense and sparse vector embeddings of text and multimodal data. It serves as a library for fine-tuning models to perform semantic similarity tasks, retrieval, and reranking. The system is distinguished by its support for diverse architectural patterns, including bi-encoders for fast similarity search and cross-encoders for high-precision reranking. It provides dedicated pipelines for multimodal embeddings, mapping text and images into a shared vector space, and implements knowledge distillation to compress large models into smaller,
GraphSAGE is a graph neural network framework designed for inductive representation learning on large-scale graphs. It functions as an inductive graph embedding tool and neighborhood aggregation engine, enabling the generation of numerical node representations that generalize to previously unseen data. The system distinguishes itself by computing node embeddings through the aggregation of features from local neighborhoods rather than relying on a global lookup table. This approach allows the framework to operate as both a supervised graph classifier for predicting categorical node classes and
Swin-Transformer is a deep learning framework designed for training and deploying hierarchical vision transformer models. It serves as a research library and toolkit for computer vision tasks, providing the infrastructure to build models that replace standard convolution operations with sliding window self-attention mechanisms. By utilizing a multi-scale feature hierarchy, the framework enables the processing of visual data at varying resolutions and spatial scales. The project distinguishes itself through its implementation of shifted window partitioning, which facilitates global information
This project is a collection of deep learning tutorials and practical implementations using TensorFlow. It provides a neural network implementation guide through code examples designed for research-oriented deep learning. The repository covers supervised and unsupervised learning workflows, including the development of sequence models for language processing and chatbots. It includes specific examples for image style transfer and the use of autoencoders for feature extraction. The project also provides demonstrations for managing large-scale datasets using binary record formats and streaming
This project provides a collection of practical machine learning code examples, including implementations for supervised, unsupervised, and reinforcement learning algorithms. It features deep learning model implementations for convolutional, recurrent, and generative architectures, alongside specific examples of reinforcement learning agents that maximize rewards in simulated environments. The repository includes dedicated data preprocessing pipelines for sanitization, feature scaling, and dimensionality reduction. It also provides implementations for a wide range of specific models, such as
This repository is a collection of implementation references and solved notebooks covering supervised, unsupervised, and reinforcement learning techniques. It provides practical guides for building predictive models, clustering algorithms, and autonomous agents. The project includes specific implementations for neural network architectures, such as multi-layer perceptrons for digit recognition, and recommender systems using collaborative and content-based filtering. It also features reinforcement learning systems that utilize deep Q-learning to optimize decision-making policies. The codebase
Detectron2 is a PyTorch computer vision framework and visual recognition platform designed for training and deploying models for object detection, image segmentation, and visual recognition. It provides a research-oriented environment for training complex vision models with multi-GPU acceleration. The project includes a specialized object detection library for identifying and locating multiple objects via bounding boxes, as well as an image segmentation toolkit for creating pixel-level masks through instance, semantic, and panoptic segmentation. Additionally, it features a human pose estimati
AllenNLP is a PyTorch-based research library and deep learning language toolkit designed for developing and training neural network architectures for linguistic tasks. It provides a distributed training system that coordinates data and gradients across multiple GPUs and a framework for integrating pretrained transformer architectures. The system distinguishes itself with a dedicated algorithmic bias mitigation tool used to identify and reduce bias in linguistic model predictions. It also includes model influence analysis to interpret predictions by calculating the influence of specific traini
This project is a comprehensive collection of machine learning educational resources, featuring a Python-based curriculum, study guides for deep learning, and a specialized knowledge base for machine learning operations. It provides structured learning paths that guide users from foundational programming through to advanced neural network implementations. The repository focuses on interactive learning by providing a directory of executable notebooks and cloud-hosted experiments. It maps theoretical research papers and textbooks to practical code implementations and maintains a curated directo
This is a comprehensive educational curriculum designed to teach machine learning fundamentals using the Python programming language. It provides a structured course covering the implementation and theory of supervised learning, unsupervised learning, and deep learning. The curriculum is delivered through interactive notebooks that combine executable code with technical tutorials. It includes dedicated guides for building neural network architectures, implementing classification and regression models, and utilizing clustering techniques for pattern discovery in unlabeled data. The materials