26 open-source projects similar to msraig/self-augmented-net, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Self Augmented Net alternative.
DeepLake is AI data infrastructure consisting of a multimodal data lake, a hybrid search engine, and a serverless vector database. It provides a PostgreSQL-based AI data runtime that combines multimodal storage with streaming pipelines to load and shuffle datasets from cloud storage directly into deep learning training pipelines. The system utilizes lazy indexing to store and slice images, audio, and video without loading entire files into memory. It enables retrieval-augmented generation by persisting high-dimensional embeddings in a serverless vector store and implementing hybrid search tha
Image composition toolbox: everything you want to know about image composition/compositing or object/subject insertion/addition/compositing.
Super-Gradients is a PyTorch computer vision framework and training library designed for the full lifecycle of vision models. It functions as a deep learning model optimizer and a deployment toolkit for training and fine-tuning models across image classification, object detection, semantic segmentation, and pose estimation tasks. The project provides specific tools for model optimization, including teacher-student knowledge distillation and numerical precision compression to reduce memory and computational requirements. It also includes the implementation of the Yolo-NAS architecture for high
Detectron2 is a PyTorch computer vision framework and visual recognition platform designed for training and deploying models for object detection, image segmentation, and visual recognition. It provides a research-oriented environment for training complex vision models with multi-GPU acceleration. The project includes a specialized object detection library for identifying and locating multiple objects via bounding boxes, as well as an image segmentation toolkit for creating pixel-level masks through instance, semantic, and panoptic segmentation. Additionally, it features a human pose estimati
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches
Industry-strength Computer Vision workflows with Keras
Kornia is a differentiable computer vision library and cross-framework tensor vision toolset. It implements vision operations as differentiable tensors to enable integration into deep learning pipelines and supports the transpilation of operations across PyTorch, TensorFlow, JAX, and NumPy. The project provides specialized toolsets for geometric vision and stereo depth, including algorithms for 3D scene reconstruction, camera calibration, and pose estimation. It further distinguishes itself as a differentiable image augmentation framework, applying random geometric and color transformations w
All-in-one training for vision models (YOLO, ViTs, RT-DETR, DINOv3): pretraining, fine-tuning, distillation.
This project is a deep learning style transfer framework designed to apply artistic styles to photographs. It functions as a photorealistic image stylizer that merges the content of one image with the visual characteristics of another while maintaining the original geometry and structural details. The system distinguishes itself through the use of matting Laplacian matrices and semantic segmentation masks to prevent distortion and preserve edge fidelity. These capabilities allow for region-specific styling, where different aesthetics can be applied to distinct objects or areas within a single
Microsoft AI for Good Lab — Biodiversity research hub. Open-source AI models, edge devices, and tools for biodiversity monitoring and conservation. Your source for MegaDetector, SPARROW, PytorchWildlife, Bioacoustics, and more.
VideoSys: An easy and efficient system for video generation
SegFormer is a semantic segmentation framework and transformer-based model designed for pixel-level image classification. It provides a deep learning architecture that assigns class labels to pixels using a hierarchical transformer encoder and a multi-layer perceptron decoder. The framework utilizes a hierarchical transformer encoder to process multi-scale features through a pyramid of blocks and an all-MLP decoder to aggregate these features without complex attention mechanisms. It incorporates overlap patch embedding to preserve local continuity and sequential self-attention reduction to ma
mmcv is a foundation library for computer vision based on PyTorch. It provides a comprehensive system for constructing convolutional neural networks, a toolkit for image and video preprocessing, and a collection of high-performance deep learning vision operators. The project is distinguished by its hardware-accelerated kernels for complex operations such as deformable convolutions and region pooling. It features a configuration-driven framework that allows for the dynamic instantiation of network layers and the registration of custom modules without modifying code. The library covers a broad
CLIP is a neural network architecture designed to map visual and textual data into a shared latent vector space. By utilizing transformer-based feature extraction and multi-modal tokenization, the system aligns images and natural language strings, enabling cross-modal similarity analysis and semantic classification. The project functions as a zero-shot classification engine, identifying image content by calculating the cosine similarity between visual features and arbitrary text labels without requiring task-specific retraining. Beyond inference, it serves as a research toolkit for evaluating
Automatic 2D-to-3D Video Conversion with CNNs
Supervision is a computer vision toolset for normalizing model outputs, managing datasets, and visualizing annotations. It provides a framework to convert predictions from various classification and detection models into a standardized data format to ensure interoperability across different computer vision pipelines. The library features a post-processor for filtering, counting, and tracking detected objects across image frames and video streams. It includes capabilities for large image tiling to improve the detection of small objects and tools for assigning persistent identities to objects t
LAVIS is a multimodal large language model framework and vision-language model library. It provides tools for training and evaluating models that integrate visual, textual, and audio data, serving as a cross-modal feature extractor and a zero-shot visual reasoning engine. The framework distinguishes itself by using frozen-backbone integration, where pretrained encoders remain non-trainable while lightweight adapter layers are updated. It employs cross-modal feature alignment to map different representations into a shared embedding space and utilizes a modular model wrapper to swap vision and