49 مستودعات
Specialized workflows for preparing, augmenting, and streaming datasets specifically for model training and feature engineering.
Explore 49 awesome GitHub repositories matching data & databases · Machine Learning Data Pipelines. Refine with filters or upvote what's useful.
LLaMA-Factory is a comprehensive suite for dataset preparation, model fine-tuning, memory optimization, and standardized API deployment. It provides a unified platform for the supervised and reward-based fine-tuning of large language models and vision-language models. The framework includes a specialized toolkit for training vision-language models and a model serving interface that deploys trained models through high-performance APIs. It utilizes precision tuning and quantization techniques to reduce the hardware requirements and memory footprint of large models. The system covers data pipel
Manages training data pipelines that integrate cloud/local storage with synthetic data generation.
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fin
Structures and generates synthetic training data via visual workflows to improve model learning efficacy.
Scikit-learn is a machine learning library for predictive data analysis that provides a collection of algorithms for supervised and unsupervised learning. It functions as a comprehensive toolkit for data preprocessing, dimensionality reduction, and model selection, allowing users to classify data objects, predict continuous values, and cluster similar items based on historical patterns. The project is defined by a unified interface design where objects either learn from data, transform data, or chain these operations into sequential workflows. To ensure performance on large or high-dimensiona
Extracts and scales features to ensure raw data meets the strict input requirements of machine learning models.
Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectures. By utilizing a directed acyclic graph approach, the framework allows users to build intricate models with multiple inputs, outputs, and shared layers, ensuring consistent numerical execution through functional state management. The project distinguishes itself as a multi-backend machine learning
Integrates utilities to load, preprocess, and format diverse data types for efficient training pipelines.
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Integrates image normalization and augmentation into automated data loading workflows for training.
Label Studio هي أداة تصنيف بيانات متعددة الأنواع ومساحة عمل لتعليق البيانات مصممة لإعداد مجموعات البيانات لتدريب التعلم الآلي. تعمل كخط أنابيب بيانات متكامل مع السحابة يستورد البيانات الخام من التخزين، ويدير عملية التعليق، ويصدر التصنيفات إلى تنسيقات موحدة. تتميز المنصة بإطار عمل تكامل لنماذج التعلم الآلي يتصل بخوادم نماذج خارجية. يتيح ذلك التعليق بمساعدة النموذج والتعلم النشط، مما يسمح للنظام بإجراء التصنيف المسبق وتحسين التنبؤات بناءً على ملاحظات البشر. يوفر البرنامج أدوات إدارة المشاريع لتنظيم مجموعات البيانات وتعيين المهام للمستخدمين عبر الوصول القائم على الأدوار. يدعم أنواع بيانات مختلفة ويستخدم محولات تخزين مستقلة عن الخلفية للاتصال بأنظمة الملفات المحلية أو مزودي التخزين السحابي. يمكن تثبيت التطبيق عبر الإعداد اليدوي أو عمليات النشر بنقرة واحدة على البنية التحتية السحابية.
Organizes and cleans raw data through labeling and formatting to make it compatible with model training pipelines.
This project is a Python-based framework that functions as a generative AI agent for programmatic data analysis. It enables users to interact with structured data sources through natural language prompts, translating these requests into executable code to perform analysis, data cleaning, and visualization. By maintaining conversational context across multi-turn interactions, the system allows for iterative exploration and the building of complex data narratives. The framework distinguishes itself through a robust semantic layer and secure execution model. It maps raw datasets to descriptive m
Provides tools for transforming and normalizing raw information into structured formats optimized for machine learning models.
This project is a comprehensive educational resource and technical documentation suite for learning and developing deep learning models. It serves as an open-source textbook, implementation manual, and framework tutorial designed to guide users through the mathematical foundations and practical application of neural networks. The resource provides detailed instructional content on building various model architectures, including convolutional and recurrent neural networks. It includes a dedicated distributed training guide and a learning path that covers the fundamentals of tensors, automatic
Provides structured guides for building training data pipelines to preprocess diverse data types.
This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs. The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
Streams, resizes, and prefetches training data to ensure high-throughput delivery to models.
NeMo is a comprehensive framework designed for the development, training, and deployment of large-scale conversational and generative artificial intelligence models. It provides an integrated platform for building multimodal systems, encompassing speech processing, language modeling, and reinforcement learning alignment. The framework is built to handle the entire lifecycle of AI development, from data curation and model pretraining to production-ready service deployment. The platform distinguishes itself through advanced distributed training capabilities, including tensor and pipeline parall
Cleans and filters large-scale multimodal datasets using accelerated workflows to ensure high-quality training inputs.
Screenpipe is a local-first platform designed to record, index, and analyze desktop activity. By capturing screen, audio, and keyboard input, it creates a comprehensive and searchable history of computer usage. The system functions as an activity recorder and automation framework, providing a persistent, context-aware memory that allows artificial intelligence agents to observe and interact with local desktop environments. The platform distinguishes itself through a privacy-focused architecture that processes all data locally. It utilizes on-device computer vision and speech recognition to tr
Processes and sanitizes desktop activity data into structured datasets suitable for training computer-use models and automating professional workflows.
This library is a collection of machine learning algorithms and neural network components implemented from scratch using only NumPy. It serves as an educational toolkit for constructing and experimenting with machine learning architectures, emphasizing a modular approach where algorithms are organized into self-contained, object-oriented classes. The project distinguishes itself by relying exclusively on array-oriented programming to perform mathematical operations, ensuring that all computations are vectorized for performance. By utilizing a standardized interface for forward and backward pa
Provides utilities for transforming raw signals and text into structured formats for machine learning.
Swift is a toolkit for the full-parameter and parameter-efficient fine-tuning of large language and multimodal models. It functions as a multimodal model trainer for text, image, video, and audio data, and includes specialized tools for model compression and reinforcement learning from human feedback. The framework provides an alignment toolkit for optimizing model behavior using preference learning algorithms and reinforcement learning. It integrates parameter-efficient fine-tuning methods to adapt models with minimal memory and compute requirements, alongside utilities for reducing hardware
Optimizes multimodal training throughput by packing diverse data types into sequences to prevent padding waste.
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
Implements user-defined data loading logic for retrieving samples and managing dataset sizes during training.
h2oGPT is a self-hosted platform designed for running large language models and executing retrieval-augmented generation workflows locally. It provides a comprehensive web interface that allows users to index private document collections into searchable databases, enabling context-aware question answering and summarization without exposing sensitive data to external services. The platform distinguishes itself by offering a modular architecture that supports both local model execution and connections to external inference servers. It facilitates the development of autonomous agents capable of
Automates the identification and calculation of data features to enhance predictive model performance.
Rerun is a multimodal data visualizer and robotics data logger designed for rendering synchronized streams of 3D spatial data, images, and time-series metrics. It functions as a tool for capturing high-frequency sensor data and AI outputs into a queryable columnar format, providing a dedicated interface for viewing MCAP recording files and analyzing physical environments. The project distinguishes itself as a machine learning dataset streamer, capable of feeding logged recordings directly into GPU buffers and PyTorch training pipelines without intermediate exports. It supports a high-performa
Streams logged recordings directly into PyTorch or GPU buffers to eliminate manual data export steps.
InternVL is a vision-language model framework that fuses a visual encoder with a large language model to translate image features into textual tokens for reasoning. It provides a system for multimodal inference and dialogue, enabling the processing of images and text to answer questions or generate descriptions. The project is distinguished by its high-resolution image processing, which uses dynamic tiling to maintain detail for images up to 4K resolution, and its chain-of-thought visual reasoning for solving complex mathematical and spatial problems. It also supports temporal frame sampling
Implements JSONL-based data formatting to support text, single-image, multi-image, and video inputs for training.
This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque
Loads and formats diverse data types like images and text for training pipelines.
DeepLake is AI data infrastructure consisting of a multimodal data lake, a hybrid search engine, and a serverless vector database. It provides a PostgreSQL-based AI data runtime that combines multimodal storage with streaming pipelines to load and shuffle datasets from cloud storage directly into deep learning training pipelines. The system utilizes lazy indexing to store and slice images, audio, and video without loading entire files into memory. It enables retrieval-augmented generation by persisting high-dimensional embeddings in a serverless vector store and implementing hybrid search tha
Provides pipelines that load, shuffle, and format diverse multimodal data types for deep learning training.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Provides a specialized SDK for transforming raw data into formats optimized for machine learning and vector storage.