Why is jakevdp/pythondatasciencehandbook a recommended Data Preprocessing GitHub Repositories repository?

Converts categorical data into numerical formats for model input.

Why is deepseek-ai/deepseek-coder a recommended Data Preprocessing GitHub Repositories repository?

Formats raw data through truncation, padding, and token insertion to meet model architecture requirements.

Why is microsoft/onnxruntime a recommended Data Preprocessing GitHub Repositories repository?

Transforms raw inputs like text or images into tensor formats required by models using integrated operators.

Why is autogluon/autogluon a recommended Data Preprocessing GitHub Repositories repository?

Stores transformed data to skip the preprocessing stage during repeated prediction calls.

Why is catboost/catboost a recommended Data Preprocessing GitHub Repositories repository?

Uses specialized categorical data types during input preparation to speed up the preprocessing of categorical features.

Why is lmcinnes/umap a recommended Data Preprocessing GitHub Repositories repository?

Reduces high-dimensional data to a lower-dimensional manifold to improve density-based clustering performance.

Why is rasbt/python-machine-learning-book-2nd-edition a recommended Data Preprocessing GitHub Repositories repository?

Provides workflows for cleaning, scaling, and encoding raw datasets to prepare them for machine learning.

Why is instillai/machine-learning-course a recommended Data Preprocessing GitHub Repositories repository?

Provides a comprehensive workflow for cleaning, transforming, and encoding data to prepare it for machine learning models.

Why is open-edge-platform/anomalib a recommended Data Preprocessing GitHub Repositories repository?

Anomalib applies transformations to raw images before passing them to the anomaly detection model.

Why is biolab/orange3 a recommended Data Preprocessing GitHub Repositories repository?

Applies transformations such as normalization, imputation, or feature selection to prepare data for modeling.

12 个仓库

Awesome GitHub RepositoriesData Preprocessing

Tools for cleaning, transforming, and encoding data for model consumption.

Distinguishing note: Focuses on categorical encoding.

Explore 12 awesome GitHub repositories matching artificial intelligence & ml · Data Preprocessing. Refine with filters or upvote what's useful.

用 AI 发现最棒的仓库。我们将通过 AI 为您搜索最匹配的仓库。

jakevdp/pythondatasciencehandbook
jakevdp/PythonDataScienceHandbook
48,561在 GitHub 上查看
This project is an interactive data science environment that combines code execution, rich media visualization, and narrative documentation into a persistent, browser-based platform. It serves as a comprehensive educational resource for scientific computing, providing a framework for iterative data analysis and machine learning prototyping. The environment is distinguished by its focus on high-performance numerical computing, utilizing vectorized array operations and memory-mapped data structures to handle large-scale computations efficiently. It features a unified estimator interface that st
Converts categorical data into numerical formats for model input.
Jupyter Notebookjupyter-notebookmatplotlibnumpy
在 GitHub 上查看48,561
deepseek-ai/deepseek-coder
deepseek-ai/DeepSeek-Coder
22,804在 GitHub 上查看
DeepSeek-Coder is a large language model and foundational neural network architecture designed specifically for software development tasks. It functions as an artificial intelligence assistant capable of interpreting complex programming instructions to generate, transpile, and structure source code. The system distinguishes itself through its ability to perform project-level code generation, analyzing broader context and patterns across entire software projects rather than isolated files. It supports multimodal input processing, allowing for the integration of text and visual data to inform i
Formats raw data through truncation, padding, and token insertion to meet model architecture requirements.
Python
在 GitHub 上查看22,804
microsoft/onnxruntime
microsoft/onnxruntime
19,347在 GitHub 上查看
This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management. The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computation
Transforms raw inputs like text or images into tensor formats required by models using integrated operators.
C++ai-frameworkdeep-learninghardware-acceleration
在 GitHub 上查看19,347
autogluon/autogluon
autogluon/autogluon
9,997在 GitHub 上查看
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Stores transformed data to skip the preprocessing stage during repeated prediction calls.
Pythonautogluonautomated-machine-learningautoml
在 GitHub 上查看9,997
catboost/catboost
catboost/catboost
8,808在 GitHub 上查看
CatBoost is a gradient boosting machine learning library used to train decision tree ensembles for regression, classification, and ranking tasks. It functions as a high-performance framework that provides a categorical data processor for transforming non-numeric features, a distributed trainer for large-scale datasets, and GPU acceleration to speed up model construction. The library distinguishes itself through native handling of categorical data and text features, removing the need for manual encoding. It includes a specialized model interpretability tool that leverages SHAP values and featu
Uses specialized categorical data types during input preparation to speed up the preprocessing of categorical features.
C++big-datacatboostcategorical-features
在 GitHub 上查看8,808
lmcinnes/umap
lmcinnes/umap
8,215在 GitHub 上查看
This project is a manifold learning and non-linear dimensionality reduction library used to project high-dimensional data into lower-dimensional spaces while preserving topological structure. It functions as a parametric embedding framework and a topological data visualization library for identifying clusters and patterns within complex datasets. The library distinguishes itself through parametric neural mapping, which uses neural networks to learn functional mappings that allow for out-of-sample projections and the reconstruction of original data. It supports supervised and semi-supervised d
Reduces high-dimensional data to a lower-dimensional manifold to improve density-based clustering performance.
Pythondimensionality-reductionmachine-learningtopological-data-analysis
在 GitHub 上查看8,215
rasbt/python-machine-learning-book-2nd-edition
rasbt/python-machine-learning-book-2nd-edition
7,194在 GitHub 上查看
This project is a machine learning educational resource and implementation guide for Python. It provides a collection of executable code and notebooks that demonstrate predictive modeling, data analysis workflows, and the implementation of various machine learning algorithms. The repository features practical examples of classification, regression, and clustering tasks using Scikit-Learn, alongside tutorials for building and training deep learning architectures with TensorFlow. These include implementations of convolutional and recurrent networks. The content covers a broad range of capabili
Provides workflows for cleaning, scaling, and encoding raw datasets to prepare them for machine learning.
Jupyter Notebookdata-sciencedeep-learningmachine-learning
在 GitHub 上查看7,194
instillai/machine-learning-course
instillai/machine-learning-course
7,043在 GitHub 上查看
这是一个全面的教育课程，旨在教授使用 Python 编程语言的机器学习基础知识。它提供了一个结构化的课程，涵盖监督学习、无监督学习和深度学习的实现与理论。该课程通过结合可执行代码和技术教程的交互式 notebook 提供。它包括用于构建神经网络架构、实现分类和回归模型，以及利用聚类技术在未标记数据中发现模式的专门指南。这些材料涵盖了完整的机器学习工作流程，包括数据预处理和分类编码、模型训练和超参数调优，以及性能评估。它还具有用于可视化模型行为的工具，例如决策边界绘图和决策树图。
Provides a comprehensive workflow for cleaning, transforming, and encoding data to prepare it for machine learning models.
Python
在 GitHub 上查看7,043
open-edge-platform/anomalib
open-edge-platform/anomalib
5,871在 GitHub 上查看
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Anomalib applies transformations to raw images before passing them to the anomaly detection model.
Pythonanomaly-detectionanomaly-localizationanomaly-segmentation
在 GitHub 上查看5,871
biolab/orange3
biolab/orange3
5,635在 GitHub 上查看
Orange3 is a visual data mining platform that provides an interactive canvas for building data analysis workflows without writing code. At its core, it offers a widget-based visual programming environment where users connect configurable components to perform data preprocessing, machine learning model training, statistical evaluation, and interactive visualization. The platform is built on NumPy-backed data tables with domain descriptors that define variable names, types, and roles, and includes a lazy SQL query proxy for working with database tables without loading all data into memory. The
Applies transformations such as normalization, imputation, or feature selection to prepare data for modeling.
Python
在 GitHub 上查看5,635
tingsongyu/pytorch-tutorial-2nd
TingsongYu/PyTorch-Tutorial-2nd
4,555在 GitHub 上查看
这是一个关于使用 PyTorch 构建神经网络的综合教学资源和课程。它涵盖了深度学习的基本构建块，包括张量操作、自动微分以及模块化神经网络组件的构建。该仓库是多个专业领域的参考指南。它提供了计算机视觉任务（如图像分类、目标检测和语义分割）的实现细节，以及涉及 Transformer、循环网络和生成模型的自然语言处理工作流。此外，它还包括生成式 AI 的参考资料，专门关注通过扩散模型和对抗网络进行图像合成。材料延伸至模型优化和部署流水线。它涵盖了通过量化和将模型导出为 ONNX 和 TensorRT 等格式来减小模型大小并提高推理速度的技术。其他能力领域包括用于并行加载的数据工程、使用自定义指标的模型评估，以及开源大语言模型的部署。该项目主要以一系列 Jupyter Notebook 的形式提供。
Provides tools for cleaning, transforming, and encoding raw data to prepare it for model consumption.
Jupyter Notebookcomputer-visiondeepsortdiffusion-models
在 GitHub 上查看4,555
google-deepmind/learning-to-learn
google-deepmind/learning-to-learn
4,068在 GitHub 上查看
本项目是一个 TensorFlow 元学习框架和研究工具包，旨在实现和训练学习到的优化器。它提供了一套用于开发学习如何优化其他模型的神经网络的工具，取代了传统的基于梯度的优化算法。该框架包括一个问题集成管理器，允许将多个不同的优化任务组合成单个加权损失函数进行同步训练。它使用工厂模式进行网络实例化，并支持定义自定义目标函数和损失图作为学习算法的目标。该工具包涵盖了广泛的功能，包括基于梯度的元优化、模型基准测试以及具有可配置展开长度的训练循环执行。它还提供了用于梯度预处理、序列化状态持久化以及报告实验统计数据（如平均最终误差和 epoch 持续时间）的工具。
Transforms input gradients using logarithmic scaling and sign extraction to prepare them for model consumption.
Pythonartificial-intelligencedeep-learningmachine-learning
在 GitHub 上查看4,068

Awesome Data Preprocessing GitHub Repositories

jakevdp/PythonDataScienceHandbook

deepseek-ai/DeepSeek-Coder

microsoft/onnxruntime

autogluon/autogluon

catboost/catboost

lmcinnes/umap

rasbt/python-machine-learning-book-2nd-edition

instillai/machine-learning-course

open-edge-platform/anomalib

biolab/orange3

TingsongYu/PyTorch-Tutorial-2nd

google-deepmind/learning-to-learn

探索子标签