Corenet

Corenet is a deep learning training framework and computer vision model library designed for developing neural networks across vision, text, and audio modalities. It functions as a distributed training orchestrator for scaling workloads across multiple compute nodes and provides a multimodal data pipeline for processing image, text, and video data.

The project includes a model conversion toolkit for transforming weights and architectures between different machine learning frameworks. It also provides tools for optimizing model performance on Apple Silicon and reducing response latency in generative models.

The framework covers a broad range of capabilities, including visual recognition tasks such as object detection, semantic segmentation, and image classification. It supports advanced training techniques such as parameter-efficient fine-tuning, contrastive language-image pre-training, and structural reparameterization.

Training and evaluation pipelines are managed through YAML-based configuration files and recipes to ensure reproducibility across environments.

Features

Computer Vision Models - Provides a comprehensive collection of neural network architectures for image classification, object detection, and semantic segmentation.

Object Detection - Identifies and locates multiple objects within images by predicting bounding boxes and class labels.

Image Segmentation - Delineates object boundaries within images using a combination of backbone networks and segmentation heads.

Computer Vision Training - Provides standardized training routines and scripts for image-based neural network architectures.

Data-Parallel Training - Supports scaling large-scale training across multiple compute nodes and GPUs using distributed data parallelism.

Distributed Training - Implements tools for configuring data and model parallelism to train large neural networks across multiple devices.

Distributed Training Orchestrators - Orchestrates the scaling of model training across multiple compute nodes using DDP, FSDP, and Slurm.

Large-Scale Model Training - Facilitates the development and training of both small and large-scale models for various computer vision and language tasks.

Large-Scale Training Frameworks - Provides infrastructure and orchestration tools for scaling neural network training across massive compute clusters.

Vision Model Training - Provides a framework for training deep neural networks for image classification and detection using configurable recipes.

Modular Backbone Architectures - Implements a modular design that decouples feature extraction backbones from task-specific prediction heads.

Multi-Modal Tokenizers - Converts raw audio and image bytes into specialized tensors for neural network processing.

Neural Network Training - Supports the core process of optimizing neural network weights for classification, detection, and segmentation across multiple GPUs.

Model Format Converters - Provides utilities that translate model weights and architectures between different framework formats.

Model Conversion Toolkits - Provides tools for transforming model weights and architectures between different machine learning frameworks.

Semantic Segmentation - Implements pixel-level image classification by attaching specialized segmentation heads to neural networks.

Multimodal Data Handlers - Loads, samples, and transforms image, text, and video data using specialized multimodal tokenizers and readers.

Training Orchestrators - Provides systems for coordinating distributed training jobs across worker nodes and parameter servers.

Deep Learning Frameworks - Functions as a deep learning training framework for vision, text, and audio modalities using distributed GPU clusters.

Structural Reparameterizations - Combines network branches during training into streamlined layers to reduce inference latency without losing accuracy.

Classification Training - Provides workflows for training image classification models using efficient backbones.

Inference Latency Optimizations - Reduces response latency in generative models by using a smaller transformer to predict the cache of a larger network.

Instruction Tuning - Allows refining pretrained models to follow specific user prompts through targeted instruction tuning.

Vision Transformer Pre-training - Implements methods for training image-based transformer models using self-supervised masked modeling techniques.

Contrastive Pre-training - Implements contrastive language-image pre-training using Vision Transformers and multi-scale variable batch sampling.

Custom Neural Architectures - Implements tailored neural network structures built from low-level components for specific tasks.

Training Configuration Management - Defines model types, datasets, and hyperparameters in YAML files to ensure reproducible training.

Model Evaluation Metrics - Implements tools for measuring the performance and quality of trained machine learning models.

Hardware-Specific Model Optimizations - Optimizes model performance for Apple Silicon to increase processing speed and reduce resource consumption.

Efficient Backbone Training - Provides tools for training efficient vision backbones using architecture search and configurable recipes.

Mobile Backbone Training - Implements training workflows for efficient mobile backbones to optimize the tradeoff between accuracy and inference latency.

Model Architectures - Provides a framework for defining and registering custom neural network components.

Hybrid Transformer Training - Provides specialized training for hybrid vision transformers that balance accuracy and latency via structural reparameterization.

Vision Transformer Training - Supports training lightweight computer vision models by combining convolutional networks and transformer architectures.

Multimodal Input Processors - Includes systems capable of ingesting and processing diverse data types including text, vision, and audio.

Language Model Pretraining - Implements methods for training language models on large text corpora before downstream task fine-tuning.

Neural Network Implementations - Provides core implementations of neural network architectures and training pipelines built from scratch.

Parameter Efficient Fine-Tuning - Adapts large pretrained models to new tasks by updating only a small subset of network weights.

Parameter-Efficient Training Toolkits - Provides frameworks that enable model fine-tuning by updating only a small subset of parameters.

Recognition Accuracy Evaluation - Provides frameworks for measuring the performance and reliability of object recognition models.

Visual Recognition Classifiers - Implements deep learning architectures for identifying and classifying objects in visual data.

Batch Size Scaling - Dynamically adjusts batch sizes during contrastive learning to optimize gradient stability and memory usage.

Training Recipes - Uses standardized configuration recipes to load pretrained weights and reproduce specific training setups.

Image Classifiers - Implements automated analysis tools that categorize images into predefined labels.

YAML Configuration Files - Uses YAML configuration files to define model hyperparameters and training recipes for reproducible experiments.

Cross-Framework Translation - Provides tools to translate neural network definitions and weights between different machine learning frameworks.

Training and Orchestration - Toolkit for training small and large-scale neural networks.

applecorenet

Features

Star history