21 个仓库
Tools for cleaning and formatting raw data for machine learning ingestion.
Distinguishing note: Focuses on training-specific data preparation, distinct from general data cleaning.
Explore 21 awesome GitHub repositories matching data & databases · Data Preprocessing Pipelines. Refine with filters or upvote what's useful.
Keras is a high-level deep learning API used to design, build, and train neural networks for tasks such as computer vision, natural language processing, and time series forecasting. It provides a framework for defining model architectures and optimizing weights through a structured interface. The project is defined by a backend-agnostic design that allows the same model code to run across different compute engines. This multi-backend execution enables users to swap underlying engines to optimize for specific hardware or performance requirements. The system supports distributed model training
Ships data preprocessing pipelines to clean and format raw datasets for efficient machine learning ingestion.
This project provides a collection of practical machine learning code examples, including implementations for supervised, unsupervised, and reinforcement learning algorithms. It features deep learning model implementations for convolutional, recurrent, and generative architectures, alongside specific examples of reinforcement learning agents that maximize rewards in simulated environments. The repository includes dedicated data preprocessing pipelines for sanitization, feature scaling, and dimensionality reduction. It also provides implementations for a wide range of specific models, such as
Provides dedicated pipelines for data sanitization, scaling, and dimensionality reduction.
Label Studio is a multi-modal data annotation platform designed to create and manage high-quality training datasets for machine learning. It functions as a self-hosted, containerized environment that supports secure, private deployments, including air-gapped configurations. The platform provides a centralized workspace for labeling diverse media types, such as images, text, audio, and time-series data, to support supervised and reinforcement learning workflows. The platform distinguishes itself through deep integration with machine learning backends, enabling active learning loops, automated
Applies automated preprocessing routines to raw data inputs to prepare them for manual annotation or model training.
This project is a deep learning library designed for training neural networks on irregular data structures, including graphs, 3D meshes, and point clouds. It functions as an extension to the PyTorch framework, providing specialized layers and kernels that enable the processing of complex, non-Euclidean information. The library distinguishes itself through a geometric deep learning toolkit that manages the unique requirements of graph-based data. It utilizes sparse matrix-based message passing to aggregate information across nodes and employs dynamic computational graph construction to accommo
Automates the transformation and feature engineering of raw graph or point cloud data to prepare it for neural network input.
This project is an educational resource providing practical code examples and implementations of machine learning algorithms using the Python language. It serves as a guide for constructing predictive pipelines, clustering models, and dimensionality reduction within the Scikit-Learn ecosystem. The repository includes comprehensive demonstrations for supervised and unsupervised learning, as well as detailed examples for implementing neural networks and deep architectures. It also provides practical guidance on exporting model parameters to JSON and wrapping trained models in web APIs for produ
Provides implementations of pipelines that sequence data preprocessing and estimator steps into a single workflow.
Nerfstudio 是一个模块化开发框架,用于训练、可视化和导出从二维图像数据集派生的三维场景表示。它提供了一个神经场景重建流水线,使用可微分体积渲染器将原始图像和相机数据转换为高保真 3D 资产和电影级视频。 该系统具有一个交互式基于 Web 的可视化器,允许用户实时监控训练进度并检查神经场景几何。它通过标准化的模块化接口将神经网络架构与训练循环解耦,从而实现自定义神经辐射场架构的开发和实验。 该框架涵盖了广泛的能力,包括用于相机姿态计算的数据集预处理、模型保真度评估以及通过相机轨迹插值生成电影级视频序列。它还包括用于导出训练场景作为 3D 资产和点云以供外部建模软件使用的实用程序。 一致的硬件执行通过捆绑图形驱动程序和系统依赖的容器化环境提供支持。
Provides pipelines for calculating camera poses and spatial orientations from raw visual inputs.
This is a cross-platform framework for building, training, and deploying custom machine learning models within the .NET ecosystem. It provides a predictive modeling engine for classification, regression, and forecasting tasks, alongside an inference runtime to generate predictions across different hardware architectures. The framework includes a gradient boosting library and supports interoperability with external models via a standardized open format. It features tools for prediction explainability, allowing the analysis of feature importance to debug model behavior and identify bias. The p
Provides tools for cleaning and transforming raw datasets from files or databases to prepare them for ML pipelines.
This repository is the official documentation for TensorFlow, a machine learning framework. It provides comprehensive guides, tutorials, and API references for building, training, and deploying machine learning models. The documentation covers the full lifecycle of machine learning projects, from constructing data pipelines and building neural networks with high-level APIs to customizing training loops and deploying trained models in production, on edge devices, or in browsers. The documentation includes step-by-step tutorials for a range of tasks, including reinforcement learning, ranking mo
Builds input pipelines to clean and transform data before feeding it into machine learning models.
River 是一个用于在线机器学习的 Python 框架,旨在对流式数据进行模型训练和评估。它通过一次处理一个观测值来更新模型参数,从而实现增量学习,无需在内存中存储完整的训练数据集。 该库通过专门的概念漂移(Concept Drift)检测系统脱颖而出,该系统监控数据分布的变化以触发模型自适应。它还提供了一个渐进式验证框架,通过在训练前对样本进行测试来模拟实时部署。 该系统涵盖了广泛的流式处理功能,包括实时特征工程、时间序列预测和在线异常检测。它支持通过增量聚类和决策树进行无监督学习,以及用于模型选择的集成聚合和 Bandit 策略。 该项目包括从 CSV 文件和 API 等来源进行流式数据摄取的实用程序,以及用于计算运行统计信息和内存高效数据草图(Data Sketches)的工具。
Chains preprocessing and estimation steps into sequential workflows for transforming raw streaming features.
本项目是一个机器学习教育课程和学习平台,通过交互式 Jupyter Notebooks 提供。它作为掌握 Python 数据科学工具包的综合指南,为数值计算、表格数据操作和统计可视化提供结构化教程。 该课程包括 Scikit-Learn 的具体实现指南,以及关于构建、训练和部署神经网络及计算机视觉模型的 TensorFlow 实践课程。它涵盖了构建预测模型的端到端过程,从初始问题定义和任务分类,到通过交互式 Web 界面部署模型。 该项目涵盖了广泛的功能领域,包括多维数组的数值计算、探索性数据分析和数据预处理例程。它为监督和无监督学习、自动化机器学习流水线、超参数优化以及使用分类指标和交叉验证的模型评估提供了详细的工作流。 教育内容组织为一系列 Notebook,将 Python 代码与叙述性解释交织在一起,以记录数据科学工作流。
Provides tools for cleaning and formatting raw data through reusable preprocessing pipelines for ML ingestion.
LatentSync 是一个音频驱动的视频生成器和潜在扩散唇形同步模型,旨在将视频中说话者的唇形动作与目标音轨同步。它提供了一个唇形同步训练框架,用于在自定义视频和音频数据集上开发同步网络。 该系统利用视频预处理流水线来清理、分割和对齐人脸数据。它包括一个视觉同步评估工具,该工具计算置信度分数以衡量生成视频中音频和视觉对齐的准确性。 该项目涵盖了自定义同步网络开发、针对硬件内存和分辨率的训练配置管理以及合成视频评估的功能。
Ships a suite of tools for cleaning, segmenting, and aligning face data to prepare video datasets.
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Builds GPU-accelerated data loading and preprocessing pipelines that eliminate CPU bottlenecks.
Leaf 是一个机器学习框架和神经网络架构工具包,用于构建、训练和部署模型。它作为一个硬件抽象层,将高层计算图映射到跨各种 CPU 和 GPU 后端及操作系统的底层指令。 该系统通过模块化架构实现灵活的模型结构设计,其中可重用的容器层封装了权重和数学运算。这允许通过嵌套组件组合复杂的神经网络。 该框架包括一个用于将原始数据集转换为干净张量的数据工程流水线,以及一个使用诊断工具识别运行时瓶颈的计算性能分析器。这些功能支持高性能计算优化和跨硬件模型部署。
Transforms raw datasets into clean, structured formats through a processing pipeline for model inference.
This project provides a translated version of the scikit-learn machine learning library guides and API references for Chinese speakers. It serves as a localized knowledge base and technical reference for implementing predictive data analysis and statistical modeling using a Python-based toolkit. The resource covers the implementation of supervised learning, including classification and regression tasks, and unsupervised learning workflows for pattern discovery and anomaly detection. It also provides guidance on data science education, specifically focusing on the use of scikit-learn for machi
Describes how to chain scaling and imputation steps into a unified pipeline for model ingestion.
Kaolin 是一个 PyTorch 3D 深度学习库,提供了一套全面的工具,用于 3D 几何处理、物理模拟、数据可视化和用于计算机视觉的梯度渲染。 该库包括一个可微分的 3D 渲染器和一个用于转换和变换 3D 表示(如网格和点云)的几何处理工具包。它还具有一个 3D 物理模拟引擎,用于计算三维物体和场景之间的物理交互和碰撞。 该工具包提供用于 3D 数据可视化的实用工具,包括创建交互式视图和转盘动画。其他功能涵盖 3D 数据集管理、数据预处理和 3D 表示渲染。
Implements 3D spatial preprocessing pipelines to transform data formats for improved deep learning training speed.
这是一个关于使用 PyTorch 构建神经网络的综合教学资源和课程。它涵盖了深度学习的基本构建块,包括张量操作、自动微分以及模块化神经网络组件的构建。 该仓库是多个专业领域的参考指南。它提供了计算机视觉任务(如图像分类、目标检测和语义分割)的实现细节,以及涉及 Transformer、循环网络和生成模型的自然语言处理工作流。此外,它还包括生成式 AI 的参考资料,专门关注通过扩散模型和对抗网络进行图像合成。 材料延伸至模型优化和部署流水线。它涵盖了通过量化和将模型导出为 ONNX 和 TensorRT 等格式来减小模型大小并提高推理速度的技术。其他能力领域包括用于并行加载的数据工程、使用自定义指标的模型评估,以及开源大语言模型的部署。 该项目主要以一系列 Jupyter Notebook 的形式提供。
Implements multi-process data loading to ensure the GPU remains saturated during training.
该项目是一个使用 Python 从零实现的机器学习算法与工具集合。它作为一个核心算法库,涵盖了回归、分类和聚类模型,旨在展示这些算法底层的数学结构,而不依赖于高层机器学习框架。 该项目专注于算法逻辑的手动实现,包括带有前向传播和权重更新的神经网络,以及多种监督和无监督学习模型。它利用 NumPy 进行向量化处理,以对大规模数据集执行矩阵计算和数学运算。 该工具包涵盖了广泛的功能,包括通过主成分分析(PCA)进行降维,以及针对数值和图像数据集的数据预处理。算法实现涵盖了线性回归、贝叶斯回归、K-Means 聚类,以及支持向量机(SVM)、决策树和 K-近邻(KNN)等多种分类方法。 该项目以一系列 Jupyter Notebook 的形式提供。
Implements a preprocessing pipeline that transforms raw numerical and image data into standardized formats.
这是一个使用 TensorFlow 2 构建、训练和部署机器学习模型的综合教育资源和教程手册。它作为结构化学习指南,涵盖了深度学习的核心概念,包括神经网络架构、自动微分和张量运算。 该手册提供了关于通过 GPU 内存管理、分布式训练和模型量化来优化执行效率的技术指导。它还包括用于构建高性能数据管道以及将模型导出到生产服务器、移动设备和 Web 浏览器的详细手册。 该材料涵盖了广泛的功能,包括使用卷积和循环网络的模型开发、自定义损失函数和层的实现,以及使用预训练模型进行迁移学习。它还探讨了边缘设备的部署策略以及使用基于云的运行时进行硬件加速。 该资源以 Jupyter Notebooks 集合的形式实现。
Details the creation and transformation of datasets using parallelization strategies for model feeding.
This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks. The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
Exports preprocessed tabular features for use with libraries like XGBoost or Random Forests.
Scanpy is a Python library for the preprocessing, visualization, and analysis of large-scale single-cell gene expression datasets. It serves as a toolkit for single-cell RNA sequencing analysis, providing a framework to process and analyze genomic data from individual cells to identify biological markers and cell types. The library includes a scalable data processing pipeline for cleaning and preparing genomic data, a clustering framework for grouping cells with similar expression profiles, and a system for modeling transitions between cell states to reconstruct biological development and dif
Provides vectorized preprocessing pipelines using NumPy and SciPy for high-throughput normalization and scaling of cell data.