Why is hiyouga/llama-factory a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Manages training data pipelines that integrate cloud/local storage with synthetic data generation.

Why is unslothai/unsloth a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Structures and generates synthetic training data via visual workflows to improve model learning efficacy.

Why is scikit-learn/scikit-learn a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Extracts and scales features to ensure raw data meets the strict input requirements of machine learning models.

Why is keras-team/keras a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Integrates utilities to load, preprocess, and format diverse data types for efficient training pipelines.

Why is d2l-ai/d2l-en a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Integrates image normalization and augmentation into automated data loading workflows for training.

Why is heartexlabs/label-studio a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Organizes and cleans raw data through labeling and formatting to make it compatible with model training pipelines.

Why is sinaptik-ai/pandas-ai a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Provides tools for transforming and normalizing raw information into structured formats optimized for machine learning models.

Why is zergtant/pytorch-handbook a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Provides structured guides for building training data pipelines to preprocess diverse data types.

Why is apache/mxnet a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Streams, resizes, and prefetches training data to ensure high-throughput delivery to models.

Why is nvidia-nemo/nemo a recommended Machine Learning Data Pipelines GitHub Repositories repository?

Cleans and filters large-scale multimodal datasets using accelerated workflows to ensure high-quality training inputs.

49 مستودعات

Awesome GitHub RepositoriesMachine Learning Data Pipelines

Specialized workflows for preparing, augmenting, and streaming datasets specifically for model training and feature engineering.

Explore 49 awesome GitHub repositories matching data & databases · Machine Learning Data Pipelines. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

hiyouga/llama-factory
hiyouga/LLaMA-Factory
72,241عرض على GitHub
LLaMA-Factory is a comprehensive suite for dataset preparation, model fine-tuning, memory optimization, and standardized API deployment. It provides a unified platform for the supervised and reward-based fine-tuning of large language models and vision-language models. The framework includes a specialized toolkit for training vision-language models and a model serving interface that deploys trained models through high-performance APIs. It utilizes precision tuning and quantization techniques to reduce the hardware requirements and memory footprint of large models. The system covers data pipel
Manages training data pipelines that integrate cloud/local storage with synthetic data generation.
Python
عرض على GitHub72,241
unslothai/unsloth
unslothai/unsloth
66,628عرض على GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fin
Structures and generates synthetic training data via visual workflows to improve model learning efficacy.
Pythonagentdeepseekdeepseek-r1
عرض على GitHub66,628
scikit-learn/scikit-learn
scikit-learn/scikit-learn
66,344عرض على GitHub
Scikit-learn is a machine learning library for predictive data analysis that provides a collection of algorithms for supervised and unsupervised learning. It functions as a comprehensive toolkit for data preprocessing, dimensionality reduction, and model selection, allowing users to classify data objects, predict continuous values, and cluster similar items based on historical patterns. The project is defined by a unified interface design where objects either learn from data, transform data, or chain these operations into sequential workflows. To ensure performance on large or high-dimensiona
Extracts and scales features to ensure raw data meets the strict input requirements of machine learning models.
Pythondata-analysisdata-sciencemachine-learning
عرض على GitHub66,344
keras-team/keras
keras-team/keras
64,094عرض على GitHub
Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectures. By utilizing a directed acyclic graph approach, the framework allows users to build intricate models with multiple inputs, outputs, and shared layers, ensuring consistent numerical execution through functional state management. The project distinguishes itself as a multi-backend machine learning
Integrates utilities to load, preprocess, and format diverse data types for efficient training pipelines.
Pythondata-sciencedeep-learningjax
عرض على GitHub64,094
d2l-ai/d2l-en
d2l-ai/d2l-en
29,001عرض على GitHub
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Integrates image normalization and augmentation into automated data loading workflows for training.
Pythonbookcomputer-visiondata-science
عرض على GitHub29,001
heartexlabs/label-studio
heartexlabs/label-studio
27,626عرض على GitHub
Label Studio هي أداة تصنيف بيانات متعددة الأنواع ومساحة عمل لتعليق البيانات مصممة لإعداد مجموعات البيانات لتدريب التعلم الآلي. تعمل كخط أنابيب بيانات متكامل مع السحابة يستورد البيانات الخام من التخزين، ويدير عملية التعليق، ويصدر التصنيفات إلى تنسيقات موحدة. تتميز المنصة بإطار عمل تكامل لنماذج التعلم الآلي يتصل بخوادم نماذج خارجية. يتيح ذلك التعليق بمساعدة النموذج والتعلم النشط، مما يسمح للنظام بإجراء التصنيف المسبق وتحسين التنبؤات بناءً على ملاحظات البشر. يوفر البرنامج أدوات إدارة المشاريع لتنظيم مجموعات البيانات وتعيين المهام للمستخدمين عبر الوصول القائم على الأدوار. يدعم أنواع بيانات مختلفة ويستخدم محولات تخزين مستقلة عن الخلفية للاتصال بأنظمة الملفات المحلية أو مزودي التخزين السحابي. يمكن تثبيت التطبيق عبر الإعداد اليدوي أو عمليات النشر بنقرة واحدة على البنية التحتية السحابية.
Organizes and cleans raw data through labeling and formatting to make it compatible with model training pipelines.
TypeScript
عرض على GitHub27,626
sinaptik-ai/pandas-ai
sinaptik-ai/pandas-ai
23,197عرض على GitHub
This project is a Python-based framework that functions as a generative AI agent for programmatic data analysis. It enables users to interact with structured data sources through natural language prompts, translating these requests into executable code to perform analysis, data cleaning, and visualization. By maintaining conversational context across multi-turn interactions, the system allows for iterative exploration and the building of complex data narratives. The framework distinguishes itself through a robust semantic layer and secure execution model. It maps raw datasets to descriptive m
Provides tools for transforming and normalizing raw information into structured formats optimized for machine learning models.
Pythonaicsvdata
عرض على GitHub23,197
zergtant/pytorch-handbook
zergtant/pytorch-handbook
21,658عرض على GitHub
This project is a comprehensive educational resource and technical documentation suite for learning and developing deep learning models. It serves as an open-source textbook, implementation manual, and framework tutorial designed to guide users through the mathematical foundations and practical application of neural networks. The resource provides detailed instructional content on building various model architectures, including convolutional and recurrent neural networks. It includes a dedicated distributed training guide and a learning path that covers the fundamentals of tensors, automatic
Provides structured guides for building training data pipelines to preprocess diverse data types.
Jupyter Notebookdeep-learningmachine-learningneural-network
عرض على GitHub21,658
apache/mxnet
apache/mxnet
20,829عرض على GitHub
This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs. The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
Streams, resizes, and prefetches training data to ensure high-throughput delivery to models.
C++mxnet
عرض على GitHub20,829
nvidia-nemo/nemo
NVIDIA-NeMo/NeMo
17,389عرض على GitHub
NeMo is a comprehensive framework designed for the development, training, and deployment of large-scale conversational and generative artificial intelligence models. It provides an integrated platform for building multimodal systems, encompassing speech processing, language modeling, and reinforcement learning alignment. The framework is built to handle the entire lifecycle of AI development, from data curation and model pretraining to production-ready service deployment. The platform distinguishes itself through advanced distributed training capabilities, including tensor and pipeline parall
Cleans and filters large-scale multimodal datasets using accelerated workflows to ensure high-quality training inputs.
Pythonasrdeeplearninggenerative-ai
عرض على GitHub17,389
screenpipe/screenpipe
screenpipe/screenpipe
16,932عرض على GitHub
Screenpipe is a local-first platform designed to record, index, and analyze desktop activity. By capturing screen, audio, and keyboard input, it creates a comprehensive and searchable history of computer usage. The system functions as an activity recorder and automation framework, providing a persistent, context-aware memory that allows artificial intelligence agents to observe and interact with local desktop environments. The platform distinguishes itself through a privacy-focused architecture that processes all data locally. It utilizes on-device computer vision and speech recognition to tr
Processes and sanitizes desktop activity data into structured datasets suitable for training computer-use models and automating professional workflows.
Rustagentsagiai
عرض على GitHub16,932
ddbourgin/numpy-ml
ddbourgin/numpy-ml
16,275عرض على GitHub
This library is a collection of machine learning algorithms and neural network components implemented from scratch using only NumPy. It serves as an educational toolkit for constructing and experimenting with machine learning architectures, emphasizing a modular approach where algorithms are organized into self-contained, object-oriented classes. The project distinguishes itself by relying exclusively on array-oriented programming to perform mathematical operations, ensuring that all computations are vectorized for performance. By utilizing a standardized interface for forward and backward pa
Provides utilities for transforming raw signals and text into structured formats for machine learning.
Pythonattentionbayesian-inferencegaussian-mixture-models
عرض على GitHub16,275
modelscope/swift
modelscope/swift
14,633عرض على GitHub
Swift is a toolkit for the full-parameter and parameter-efficient fine-tuning of large language and multimodal models. It functions as a multimodal model trainer for text, image, video, and audio data, and includes specialized tools for model compression and reinforcement learning from human feedback. The framework provides an alignment toolkit for optimizing model behavior using preference learning algorithms and reinforcement learning. It integrates parameter-efficient fine-tuning methods to adapt models with minimal memory and compute requirements, alongside utilities for reducing hardware
Optimizes multimodal training throughput by packing diverse data types into sequences to prevent padding waste.
Python
عرض على GitHub14,633
alibaba/mnn
alibaba/MNN
14,242عرض على GitHub
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
Implements user-defined data loading logic for retrieving samples and managing dataset sizes during training.
C++armconvolutiondeep-learning
عرض على GitHub14,242
h2oai/h2ogpt
h2oai/h2ogpt
12,016عرض على GitHub
h2oGPT is a self-hosted platform designed for running large language models and executing retrieval-augmented generation workflows locally. It provides a comprehensive web interface that allows users to index private document collections into searchable databases, enabling context-aware question answering and summarization without exposing sensitive data to external services. The platform distinguishes itself by offering a modular architecture that supports both local model execution and connections to external inference servers. It facilitates the development of autonomous agents capable of
Automates the identification and calculation of data features to enhance predictive model performance.
Pythonaichatgptembeddings
عرض على GitHub12,016
rerun-io/rerun
rerun-io/rerun
10,214عرض على GitHub
Rerun is a multimodal data visualizer and robotics data logger designed for rendering synchronized streams of 3D spatial data, images, and time-series metrics. It functions as a tool for capturing high-frequency sensor data and AI outputs into a queryable columnar format, providing a dedicated interface for viewing MCAP recording files and analyzing physical environments. The project distinguishes itself as a machine learning dataset streamer, capable of feeding logged recordings directly into GPU buffers and PyTorch training pipelines without intermediate exports. It supports a high-performa
Streams logged recordings directly into PyTorch or GPU buffers to eliminate manual data export steps.
Rustcomputer-visioncppmultimodal
عرض على GitHub10,214
opengvlab/internvl
OpenGVLab/InternVL
10,061عرض على GitHub
InternVL is a vision-language model framework that fuses a visual encoder with a large language model to translate image features into textual tokens for reasoning. It provides a system for multimodal inference and dialogue, enabling the processing of images and text to answer questions or generate descriptions. The project is distinguished by its high-resolution image processing, which uses dynamic tiling to maintain detail for images up to 4K resolution, and its chain-of-thought visual reasoning for solving complex mathematical and spatial problems. It also supports temporal frame sampling
Implements JSONL-based data formatting to support text, single-image, multi-image, and video inputs for training.
Pythongptgpt-4ogpt-4v
عرض على GitHub10,061
lyhue1991/eat_tensorflow2_in_30_days
lyhue1991/eat_tensorflow2_in_30_days
9,933عرض على GitHub
This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque
Loads and formats diverse data types like images and text for training pipelines.
Pythontensorflowtensorflow-examplestensorflow-tutorial
عرض على GitHub9,933
activeloopai/deeplake
activeloopai/deeplake
9,175عرض على GitHub
DeepLake is AI data infrastructure consisting of a multimodal data lake, a hybrid search engine, and a serverless vector database. It provides a PostgreSQL-based AI data runtime that combines multimodal storage with streaming pipelines to load and shuffle datasets from cloud storage directly into deep learning training pipelines. The system utilizes lazy indexing to store and slice images, audio, and video without loading entire files into memory. It enables retrieval-augmented generation by persisting high-dimensional embeddings in a serverless vector store and implementing hybrid search tha
Provides pipelines that load, shuffle, and format diverse multimodal data types for deep learning training.
C++agentagentic-ragai
عرض على GitHub9,175
lancedb/lancedb
lancedb/lancedb
9,031عرض على GitHub
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Provides a specialized SDK for transforming raw data into formats optimized for machine learning and vector storage.
HTMLapproximate-nearest-neighbor-searchimage-searchnearest-neighbor-search
عرض على GitHub9,031

Awesome Machine Learning Data Pipelines GitHub Repositories

hiyouga/LLaMA-Factory

unslothai/unsloth

scikit-learn/scikit-learn

keras-team/keras

d2l-ai/d2l-en

heartexlabs/label-studio

sinaptik-ai/pandas-ai

zergtant/pytorch-handbook

apache/mxnet

NVIDIA-NeMo/NeMo

screenpipe/screenpipe

ddbourgin/numpy-ml

modelscope/swift

alibaba/MNN

h2oai/h2ogpt

rerun-io/rerun

OpenGVLab/InternVL

lyhue1991/eat_tensorflow2_in_30_days

activeloopai/deeplake

lancedb/lancedb

استكشف الوسوم الفرعية