12 个仓库
Tools for cleaning, transforming, and encoding data for model consumption.
Distinguishing note: Focuses on categorical encoding.
Explore 12 awesome GitHub repositories matching artificial intelligence & ml · Data Preprocessing. Refine with filters or upvote what's useful.
This project is an interactive data science environment that combines code execution, rich media visualization, and narrative documentation into a persistent, browser-based platform. It serves as a comprehensive educational resource for scientific computing, providing a framework for iterative data analysis and machine learning prototyping. The environment is distinguished by its focus on high-performance numerical computing, utilizing vectorized array operations and memory-mapped data structures to handle large-scale computations efficiently. It features a unified estimator interface that st
Converts categorical data into numerical formats for model input.
DeepSeek-Coder is a large language model and foundational neural network architecture designed specifically for software development tasks. It functions as an artificial intelligence assistant capable of interpreting complex programming instructions to generate, transpile, and structure source code. The system distinguishes itself through its ability to perform project-level code generation, analyzing broader context and patterns across entire software projects rather than isolated files. It supports multimodal input processing, allowing for the integration of text and visual data to inform i
Formats raw data through truncation, padding, and token insertion to meet model architecture requirements.
This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management. The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computation
Transforms raw inputs like text or images into tensor formats required by models using integrated operators.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Stores transformed data to skip the preprocessing stage during repeated prediction calls.
CatBoost is a gradient boosting machine learning library used to train decision tree ensembles for regression, classification, and ranking tasks. It functions as a high-performance framework that provides a categorical data processor for transforming non-numeric features, a distributed trainer for large-scale datasets, and GPU acceleration to speed up model construction. The library distinguishes itself through native handling of categorical data and text features, removing the need for manual encoding. It includes a specialized model interpretability tool that leverages SHAP values and featu
Uses specialized categorical data types during input preparation to speed up the preprocessing of categorical features.
This project is a manifold learning and non-linear dimensionality reduction library used to project high-dimensional data into lower-dimensional spaces while preserving topological structure. It functions as a parametric embedding framework and a topological data visualization library for identifying clusters and patterns within complex datasets. The library distinguishes itself through parametric neural mapping, which uses neural networks to learn functional mappings that allow for out-of-sample projections and the reconstruction of original data. It supports supervised and semi-supervised d
Reduces high-dimensional data to a lower-dimensional manifold to improve density-based clustering performance.
This project is a machine learning educational resource and implementation guide for Python. It provides a collection of executable code and notebooks that demonstrate predictive modeling, data analysis workflows, and the implementation of various machine learning algorithms. The repository features practical examples of classification, regression, and clustering tasks using Scikit-Learn, alongside tutorials for building and training deep learning architectures with TensorFlow. These include implementations of convolutional and recurrent networks. The content covers a broad range of capabili
Provides workflows for cleaning, scaling, and encoding raw datasets to prepare them for machine learning.
这是一个全面的教育课程,旨在教授使用 Python 编程语言的机器学习基础知识。它提供了一个结构化的课程,涵盖监督学习、无监督学习和深度学习的实现与理论。 该课程通过结合可执行代码和技术教程的交互式 notebook 提供。它包括用于构建神经网络架构、实现分类和回归模型,以及利用聚类技术在未标记数据中发现模式的专门指南。 这些材料涵盖了完整的机器学习工作流程,包括数据预处理和分类编码、模型训练和超参数调优,以及性能评估。它还具有用于可视化模型行为的工具,例如决策边界绘图和决策树图。
Provides a comprehensive workflow for cleaning, transforming, and encoding data to prepare it for machine learning models.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Anomalib applies transformations to raw images before passing them to the anomaly detection model.
Orange3 is a visual data mining platform that provides an interactive canvas for building data analysis workflows without writing code. At its core, it offers a widget-based visual programming environment where users connect configurable components to perform data preprocessing, machine learning model training, statistical evaluation, and interactive visualization. The platform is built on NumPy-backed data tables with domain descriptors that define variable names, types, and roles, and includes a lazy SQL query proxy for working with database tables without loading all data into memory. The
Applies transformations such as normalization, imputation, or feature selection to prepare data for modeling.
这是一个关于使用 PyTorch 构建神经网络的综合教学资源和课程。它涵盖了深度学习的基本构建块,包括张量操作、自动微分以及模块化神经网络组件的构建。 该仓库是多个专业领域的参考指南。它提供了计算机视觉任务(如图像分类、目标检测和语义分割)的实现细节,以及涉及 Transformer、循环网络和生成模型的自然语言处理工作流。此外,它还包括生成式 AI 的参考资料,专门关注通过扩散模型和对抗网络进行图像合成。 材料延伸至模型优化和部署流水线。它涵盖了通过量化和将模型导出为 ONNX 和 TensorRT 等格式来减小模型大小并提高推理速度的技术。其他能力领域包括用于并行加载的数据工程、使用自定义指标的模型评估,以及开源大语言模型的部署。 该项目主要以一系列 Jupyter Notebook 的形式提供。
Provides tools for cleaning, transforming, and encoding raw data to prepare it for model consumption.
本项目是一个 TensorFlow 元学习框架和研究工具包,旨在实现和训练学习到的优化器。它提供了一套用于开发学习如何优化其他模型的神经网络的工具,取代了传统的基于梯度的优化算法。 该框架包括一个问题集成管理器,允许将多个不同的优化任务组合成单个加权损失函数进行同步训练。它使用工厂模式进行网络实例化,并支持定义自定义目标函数和损失图作为学习算法的目标。 该工具包涵盖了广泛的功能,包括基于梯度的元优化、模型基准测试以及具有可配置展开长度的训练循环执行。它还提供了用于梯度预处理、序列化状态持久化以及报告实验统计数据(如平均最终误差和 epoch 持续时间)的工具。
Transforms input gradients using logarithmic scaling and sign extraction to prepare them for model consumption.