21 个仓库
Techniques for replacing null entries using constant values or statistical measures.
Distinct from Null Value Handling: Candidates focus on native handling or sentinel replacement; this is the general act of filling missing data.
Explore 21 awesome GitHub repositories matching data & databases · Missing Value Imputation. Refine with filters or upvote what's useful.
This project is an educational resource providing practical code examples and implementations of machine learning algorithms using the Python language. It serves as a guide for constructing predictive pipelines, clustering models, and dimensionality reduction within the Scikit-Learn ecosystem. The repository includes comprehensive demonstrations for supervised and unsupervised learning, as well as detailed examples for implementing neural networks and deep architectures. It also provides practical guidance on exporting model parameters to JSON and wrapping trained models in web APIs for produ
Estimates placeholder values for missing data using global statistics or k-nearest neighbors.
Home Assistant is a local home automation platform and server that acts as an IoT device orchestrator. It integrates diverse smart home hardware by wrapping third-party APIs into a standardized logic layer and stores all system state and historical statistics on local hardware to eliminate cloud dependencies. The system functions as a Matter IoT controller and an MQTT home automation bridge, allowing for local interoperability between different manufacturers. It features a state-based entity model and an internal event bus that decouple physical device logic from system automation. The platf
Replaces unknown or unavailable sensor states with default values or alternative logic branches.
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Replaces null values with constants, column-specific dictionaries, or calculated statistics.
This project is a machine learning educational resource and implementation guide for Python. It provides a collection of executable code and notebooks that demonstrate predictive modeling, data analysis workflows, and the implementation of various machine learning algorithms. The repository features practical examples of classification, regression, and clustering tasks using Scikit-Learn, alongside tutorials for building and training deep learning architectures with TensorFlow. These include implementations of convolutional and recurrent networks. The content covers a broad range of capabili
Implements techniques for resolving missing tabular data through removal or statistical imputation.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Fills missing data points using statistical or model-based imputation methods.
本项目是一个机器学习教育课程和学习平台,通过交互式 Jupyter Notebooks 提供。它作为掌握 Python 数据科学工具包的综合指南,为数值计算、表格数据操作和统计可视化提供结构化教程。 该课程包括 Scikit-Learn 的具体实现指南,以及关于构建、训练和部署神经网络及计算机视觉模型的 TensorFlow 实践课程。它涵盖了构建预测模型的端到端过程,从初始问题定义和任务分类,到通过交互式 Web 界面部署模型。 该项目涵盖了广泛的功能领域,包括多维数组的数值计算、探索性数据分析和数据预处理例程。它为监督和无监督学习、自动化机器学习流水线、超参数优化以及使用分类指标和交叉验证的模型评估提供了详细的工作流。 教育内容组织为一系列 Notebook,将 Python 代码与叙述性解释交织在一起,以记录数据科学工作流。
Employs techniques for replacing null entries using constant values or statistical measures like median imputation.
Concurrent Ruby is a comprehensive concurrency toolkit for the Ruby language that provides thread-safe data structures, synchronization primitives, and asynchronous execution patterns. It implements core concurrency abstractions including an actor model framework where isolated actors communicate through asynchronous message passing, a future and promise system for composing non-blocking operations, and thread pool executors that manage reusable worker threads for concurrent task execution. The library distinguishes itself through a broad set of coordination mechanisms that go beyond basic th
Returns a supplied default value when an optional container is empty.
Orange3 is a visual data mining platform that provides an interactive canvas for building data analysis workflows without writing code. At its core, it offers a widget-based visual programming environment where users connect configurable components to perform data preprocessing, machine learning model training, statistical evaluation, and interactive visualization. The platform is built on NumPy-backed data tables with domain descriptors that define variable names, types, and roles, and includes a lazy SQL query proxy for working with database tables without loading all data into memory. The
Provides a widget to detect and process missing entries using imputation or removal strategies.
This project provides a translated version of the scikit-learn machine learning library guides and API references for Chinese speakers. It serves as a localized knowledge base and technical reference for implementing predictive data analysis and statistical modeling using a Python-based toolkit. The resource covers the implementation of supervised learning, including classification and regression tasks, and unsupervised learning workflows for pattern discovery and anomaly detection. It also provides guidance on data science education, specifically focusing on the use of scikit-learn for machi
Explains techniques for filling missing data gaps using iterative estimators to maintain dataset integrity.
cuml is a GPU-accelerated machine learning library and framework that uses CUDA to accelerate tabular data preprocessing and model execution. It provides a suite of tools for training and deploying classification, regression, and clustering models on NVIDIA GPUs and GPU clusters. The library is designed for scalability, offering a distributed GPU machine learning environment that can spread computation and data across multiple hardware accelerators and nodes to handle datasets exceeding single-device memory. It mirrors standard estimator interfaces to allow the replacement of CPU-based models
Fills gaps in datasets using univariate imputation to complete missing data points.
r4ds 是一个数据科学课程和教育资源,专为精通 R 编程语言而设计。它为导入、整理、转换和可视化数据的端到端过程提供了结构化的学习路径。 该项目强调可重复的数据科学指南和全面的数据整理课程。它包括关于用于分层数据可视化的图形语法(grammar of graphics)的专业教程,以及使用 Quarto 创建的融合可执行代码与叙述性文本的技术出版物。 该材料涵盖了广泛的分析能力,包括来自不同来源的数据摄取、关系数据连接以及分类变量的管理。它还涉及数据清洗、数学建模以及多格式专业报告和演示文稿的生成。 该课程侧重于函数式编程和整洁数据(tidy data)原则的实际应用,以创建透明且可重复的分析。
Populates null entries by carrying the last observation forward or applying fixed default values.
This is an interactive notebook-based course that teaches machine learning from Python fundamentals through deep learning and natural language processing. It uses real datasets and multiple frameworks within a structured, hands-on curriculum that combines concise explanations with executable code cells, built-in datasets, and embedded exercise checkpoints. Learning progresses through data preparation and exploration, classical machine learning workflows, computer vision with convolutional neural networks, and natural language processing with deep learning, all delivered as a cohesive progressi
Provides workflows for filling missing data using mean, median, or most frequent values.
该项目是一个综合性教育计划和深度学习框架,旨在通过 Notebook 和代码示例教授 PyTorch 深度学习实践。它作为一个用于构建、训练和部署神经网络的高级库,充当模型训练编排器,协调 PyTorch 模型、优化器和损失函数。 该项目为计算机视觉、自然语言处理和表格数据预处理提供了专门的工具包。它通过高级训练控制脱颖而出,例如判别式学习率、用于自定义训练逻辑的双向回调系统,以及自动化设备放置和训练循环的高级学习器抽象。 该框架涵盖了广泛的能力面,包括自动化数据流水线构建、模型架构分析以及跨分类、回归和分割任务的性能评估。它还包括用于跨多个 GPU 进行分布式训练的工具、用于内存优化的混合精度训练,以及对医学影像数据的专门支持。 该项目以一系列 Jupyter Notebook 的形式交付。
Provides imputation strategies to fill missing entries in continuous columns using medians, modes, or constants.
This is a pandas-based technical analysis library and financial feature engineering tool. It serves as a vectorized indicator calculator that transforms raw price and volume data into derived metrics for time series analysis. The library uses a NumPy-based engine to perform mathematical operations across entire arrays, avoiding iterative loops to maintain high performance. It organizes technical indicators into a modular class hierarchy with a consistent interface, allowing for bulk feature generation and the direct appending of results as new columns to a pandas DataFrame. The system covers
Provides configurable forward-fill and zero-fill strategies to handle calculation gaps in financial datasets.
这是一个面向 .NET 生态系统的科学计算框架,提供了一套全面的数值分析、统计和数学优化库。它作为开发机器学习、数字信号处理和计算机视觉应用的基础工具包。 该框架提供了用于训练和部署预测模型的专用工具包,包括神经网络、支持向量机和决策树。它还通过对实时视觉分析(如对象跟踪和面部特征检测)的深度集成,以及用于捕获和过滤音频及传感器信号的专用数字信号处理库而脱颖而出。 其功能范围扩展到高级矩阵分解和线性代数、概率状态建模和启发式搜索算法。它还涵盖了广泛的数据操作实用程序,从降维和归一化到空间数据组织和科学可视化组件。 该系统包括用于摄像机配置、GPIO 端口管理和专用深度传感硬件的硬件集成控制器。
Fills empty data entries using statistical measures or constant values to maintain dataset integrity.
This project is a collection of comprehensive guides and reference materials designed for technical interviews, machine learning system design, and professional development. It serves as a technical knowledge base and a career coaching manual, providing structured resources to help candidates navigate the machine learning hiring landscape. The resource distinguishes itself by offering detailed frameworks for comparing industry roles, analyzing company types, and planning long-term career progression. It provides specific guidance on evaluating employer organizational health, identifying resea
Fills or models absent data points while mitigating selection bias from imputation.
json_repair is a Python library that automatically fixes common JSON syntax errors, such as trailing commas, missing quotes, unclosed brackets, and stray text, producing valid JSON output. It can also complete broken structures by closing unclosed arrays and objects, and fill missing values with sensible defaults like empty strings or null. The library distinguishes itself by handling JSON from large language model outputs, stripping markdown fences, comments, and surrounding prose before parsing. It supports schema-guided repairs, using a JSON Schema to fill missing values, coerce data types
Fills missing JSON fields with sensible defaults like empty strings or null during repair.
Nixtla 是一个以基于 Transformer 的基础模型为中心的时序分析平台。它为预测和异常检测提供零样本推理,允许系统在无需模型重新训练的情况下预测新时序的未来值。 该项目专为大规模分析而设计,使用分布式推理扩展和预测并行化来处理数百万个数据序列。它支持微调适配以针对特定领域数据集调整预训练权重,并提供从本地执行和私有容器到作为 Snowflake 内存储过程集成等多种部署选项。 能力包括长周期和间歇性需求预测、假设场景分析以及预测不确定性量化。该系统还提供了一个完整的数据工程流水线,用于审计、清理和使用外生变量及基于日期的指标来丰富时序数据。 模型可靠性通过交叉验证回测、预测准确性验证以及用于超参数记录的实验跟踪来管理。
Handles target series containing NaN values by managing continuous timestamp sequences to maintain reliability.
该项目是一个针对 R 的高性能表格数据处理框架,旨在以内存效率和速度处理海量数据集。它提供了一种增强的数据结构,利用引用语义和就地修改来执行复杂的转换,而无需不必要的对象复制开销。 该库凭借其底层架构优化脱颖而出,包括多线程并行处理、基数排序和内存映射文件解析。通过将关键的数据操作和聚合例程卸载到编译后的 C 代码,它实现了对原本计算昂贵的任务的快速执行。其核心引擎支持高级关系操作,如非等值连接、滚动连接和重叠区间连接,以及用于加速重复数据访问的自动二级索引。 除了主要的处理功能外,该项目还提供了一套全面的数据生命周期管理工具。这包括具有自动类型检测的高速摄取和序列化工具,以及对时间序列分析和多维聚合的专门支持。该框架旨在实现可扩展性,允许用户在包含数十亿行的数据集上执行复杂的分组、过滤和重塑操作,同时保持系统稳定性和性能。
Fills missing data points by replacing them with the first available non-missing value from a set.
This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks. The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
Replaces missing entries in continuous columns with computed values for tabular data preparation.