27 个仓库
Methods for filling gaps in datasets using scalar replacement or propagation.
Distinguishing note: Focuses on filling missing values rather than identification or removal.
Explore 27 awesome GitHub repositories matching data & databases · Missing Data Imputation. Refine with filters or upvote what's useful.
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized
Enables replacing missing values with scalars or propagating existing values to fill gaps.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Replaces null values using literal values, computed expressions, or interpolation methods.
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Handles incomplete records by imputing missing values with statistical estimates or converting gaps into indicator features.
Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models. The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
Fills gaps in continuous data columns using strategies like median or mode to ensure complete datasets.
Backtrader is a Python framework designed for the development, backtesting, and live execution of algorithmic trading strategies. It provides a comprehensive environment for quantitative finance, allowing users to simulate trading logic against historical market data or connect directly to brokerage platforms for automated real-time trading. The project distinguishes itself through a unified event-driven architecture that treats backtesting and live trading with the same API. This consistency is supported by a flexible data-feed abstraction layer that normalizes diverse financial sources, ena
Populates missing time intervals in financial data feeds using configurable price and volume values.
This project is a machine learning algorithm reference and implementation guide that provides theoretical foundations and code for supervised learning, deep learning, and natural language processing. It serves as a comprehensive toolkit for implementing predictive models and a technical reference for algorithm engineering. The project focuses on ensemble learning frameworks, including the construction of decision trees, random forests, and gradient boosting models. It also functions as a probabilistic graphical model library and an NLP algorithm reference, with specific implementations for se
Fills missing data by iteratively estimating values based on classification path similarity within a forest.
Statsmodels is a comprehensive Python library designed for statistical modeling, econometric research, and data analysis. It provides a robust framework for estimating and diagnosing a wide range of statistical models, enabling users to perform rigorous hypothesis testing, regression analysis, and complex data exploration within structured environments. The library distinguishes itself through its support for advanced statistical methodologies, including state space representation for dynamic systems and generalized linear frameworks that accommodate non-normal response variables. It offers s
Fills gaps in datasets using multiple imputation methods to ensure data integrity.
This project is a framework for the efficient serialization and deserialization of data structures. It provides a unified, macro-based interface that automates the conversion of complex internal objects into standardized formats and reconstructs them from raw input streams or buffers. By leveraging compile-time code generation, the library minimizes manual implementation overhead while ensuring consistent logic across diverse data types. The framework distinguishes itself through a format-agnostic data model and a visitor-based parsing architecture that decouples data structures from specific
Automatically populates missing fields with default values during the deserialization process.
PyMC is a Bayesian probabilistic programming framework used for building probabilistic models and performing Bayesian inference. It provides a probabilistic graphical model library for specifying random variables, priors, and likelihood functions, supported by an MCMC sampling engine and variational inference tools to estimate posterior distributions. The framework features a GPU-accelerated inference backend that compiles models into machine code to increase execution speed. It utilizes a backend-agnostic tensor execution model and just-in-time graph compilation to optimize the computation o
Estimates missing values within datasets using probabilistic frameworks to maintain uncertainty.
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
Fills gaps in extracted feature sets using specialized transformers to maintain compatibility with ML models.
This project is a Python financial analytics framework and quantitative trading library. It provides a suite of mathematical tools for asset pricing, statistical market analysis, and the development of algorithmic trading strategies. The library is distinguished by its focus on currency and commodity correlation modeling, using regression and normalization to identify exchange rate drivers. It features a specialized portfolio optimization engine that applies graph theory, such as clique centrality and degeneracy ordering, alongside quadratic programming to balance risk-adjusted returns. The
Fills gaps in pricing datasets by applying synthetic control methods based on similar economic entities.
Handles missing values natively in raw tabular input without requiring any preprocessing or imputation.
tsai 是一个用于时间序列分类、回归和预测的深度学习库。它基于 PyTorch 和 fastai 构建,提供了一个框架,用于为序列数据分配标签、预测单变量或多变量序列的未来值,并通过自监督学习在未标记数据上训练表征。 该库的特色在于其专业的时间工程和缩放能力。它包含用于捕捉季节性模式的周期性时间编码工具,以及用于处理超出内存限制数据集的在线窗口切片功能。它还支持多模态输入管道,能够将静态分类特征与动态连续序列相结合。 该工具包涵盖了广泛的预处理和评估需求,包括滑动窗口分割、缺失数据插补以及将表格数据帧转换为结构化张量。模型性能通过向前验证(walk-forward validation)和特征重要性分析进行评估,以确保时间一致性。
Fills gaps in sequential datasets using estimation techniques to ensure continuity for downstream modeling.
OSMnx 是一个 Python 库,用于从 OpenStreetMap 下载、建模和分析街道网络及其他地理空间特征。它使用户能够检索和处理世界各地的现实基础设施数据,提供用于网络分析、空间查询和可视化的工具。 该库提供了处理城市特征(如建筑轮廓、公交站点和高程数据)以及网络统计信息(如交叉口密度和迂回度)的功能。它支持多种出行模式,包括驾驶、步行和骑行,并可以计算最短路径、推算行驶速度和生成等时线地图。其他功能包括地理编码、地图匹配、坐标投影以及以各种格式保存和加载网络的能力。 OSMnx 提供了将街道网络和地理空间特征可视化为静态地图或交互式 Web 地图的工具,并可以绘制图底图。该库可通过标准 Python 包安装方法获取。
Imputes missing travel speeds and calculates edge travel times for street network routing.
This project is a comprehensive machine learning educational resource and tutorial series delivered as a collection of interactive Jupyter Notebooks. It provides practical Python implementations for the end-to-end machine learning lifecycle, covering supervised and unsupervised learning, deep learning, and reinforcement learning. The resource distinguishes itself by providing detailed implementation guides for complex architectures, including transformers, generative adversarial networks, and convolutional neural networks. It also features specialized courseware for developing reinforcement l
Provides methods for filling gaps in tabular datasets using scalar replacement or statistical propagation.
本项目是一个全面的 Python 编程教育材料合集,包括教程、练习与精选代码示例。它作为一个学习课程与软件工程工具包,利用 Jupyter Notebooks 将可执行代码与描述性教育文本相结合。 该仓库提供了构建大语言模型应用的实践指南,例如检索增强生成(RAG)系统、有状态 AI 代理与机器学习工作流。它通过提供结构化的代理编码工作流脱颖而出,涵盖了上下文窗口蒸馏、与提供商无关的模型路由以及模式强制的结构化输出。 这些材料涵盖了广泛的软件工程能力,包括使用分布式任务队列的异步编程、使用 REST API 的 Web 应用开发以及数据分析工作流。它还包括用于掌握面向对象设计、实现 CI/CD 流水线以及应用专业 Linting 与格式化标准的资源。
Provides techniques for filling missing values in datasets using scalar replacement or propagation.
Vega-Lite is a high-level declarative language for specifying interactive, multi-view visualizations. It compiles a concise JSON specification into a full Vega visualization, automatically inferring scales, axes, and legends from encoding declarations. The grammar-of-graphics encoding maps data fields to visual channels such as position, color, size, and shape, while a multi-view composition grammar enables layered, faceted, concatenated, and repeated layouts. Reactive parameter binding links named parameters to input widgets, selections, and expressions for dynamic updates. The project suppo
Vega-Lite fills missing data values by generating new data points using a constant value or statistical methods within groups.
This is an interactive notebook-based course that teaches machine learning from Python fundamentals through deep learning and natural language processing. It uses real datasets and multiple frameworks within a structured, hands-on curriculum that combines concise explanations with executable code cells, built-in datasets, and embedded exercise checkpoints. Learning progresses through data preparation and exploration, classical machine learning workflows, computer vision with convolutional neural networks, and natural language processing with deep learning, all delivered as a cohesive progressi
Implements methods for detecting and filling gaps in datasets using scalar replacement and interpolation.
Connexion 是一个规范驱动的框架,用于构建自动将 OpenAPI 规范映射到应用逻辑的 API。它使用这些规范来自动化路由、请求验证和响应序列化,并通过操作 ID 将 API 操作链接到后端处理函数。 该项目通过提供一个模式驱动的 Mock 服务器来区分开来,该服务器使用规范中的示例响应来模拟 API 行为,而无需后端逻辑。它还包括一个动态文档托管系统,将 API 规范转换为实时交互式控制台,用于探索和测试端点。 该框架涵盖了广泛的功能领域,包括通过基于中间件的身份验证和作用域验证实施安全性、可插拔的请求和响应验证逻辑,以及向类型化函数参数自动注入参数。它还提供了用于应用生命周期管理、自定义中间件集成和请求模拟测试的实用工具。 该项目可用于引导独立 Web 应用,或包装在现有框架周围以添加规范驱动的功能。
Populates missing fields in incoming request bodies using default values specified in the API definition.
This project is a collection of comprehensive guides and reference materials designed for technical interviews, machine learning system design, and professional development. It serves as a technical knowledge base and a career coaching manual, providing structured resources to help candidates navigate the machine learning hiring landscape. The resource distinguishes itself by offering detailed frameworks for comparing industry roles, analyzing company types, and planning long-term career progression. It provides specific guidance on evaluating employer organizational health, identifying resea
Detects anomalous data points and decides whether to remove, cap, or transform them.