27 dépôts
Methods for filling gaps in datasets using scalar replacement or propagation.
Distinguishing note: Focuses on filling missing values rather than identification or removal.
Explore 27 awesome GitHub repositories matching data & databases · Missing Data Imputation. Refine with filters or upvote what's useful.
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized
Enables replacing missing values with scalars or propagating existing values to fill gaps.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Replaces null values using literal values, computed expressions, or interpolation methods.
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Handles incomplete records by imputing missing values with statistical estimates or converting gaps into indicator features.
Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models. The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
Fills gaps in continuous data columns using strategies like median or mode to ensure complete datasets.
Backtrader is a Python framework designed for the development, backtesting, and live execution of algorithmic trading strategies. It provides a comprehensive environment for quantitative finance, allowing users to simulate trading logic against historical market data or connect directly to brokerage platforms for automated real-time trading. The project distinguishes itself through a unified event-driven architecture that treats backtesting and live trading with the same API. This consistency is supported by a flexible data-feed abstraction layer that normalizes diverse financial sources, ena
Populates missing time intervals in financial data feeds using configurable price and volume values.
This project is a machine learning algorithm reference and implementation guide that provides theoretical foundations and code for supervised learning, deep learning, and natural language processing. It serves as a comprehensive toolkit for implementing predictive models and a technical reference for algorithm engineering. The project focuses on ensemble learning frameworks, including the construction of decision trees, random forests, and gradient boosting models. It also functions as a probabilistic graphical model library and an NLP algorithm reference, with specific implementations for se
Fills missing data by iteratively estimating values based on classification path similarity within a forest.
Statsmodels is a comprehensive Python library designed for statistical modeling, econometric research, and data analysis. It provides a robust framework for estimating and diagnosing a wide range of statistical models, enabling users to perform rigorous hypothesis testing, regression analysis, and complex data exploration within structured environments. The library distinguishes itself through its support for advanced statistical methodologies, including state space representation for dynamic systems and generalized linear frameworks that accommodate non-normal response variables. It offers s
Fills gaps in datasets using multiple imputation methods to ensure data integrity.
This project is a framework for the efficient serialization and deserialization of data structures. It provides a unified, macro-based interface that automates the conversion of complex internal objects into standardized formats and reconstructs them from raw input streams or buffers. By leveraging compile-time code generation, the library minimizes manual implementation overhead while ensuring consistent logic across diverse data types. The framework distinguishes itself through a format-agnostic data model and a visitor-based parsing architecture that decouples data structures from specific
Automatically populates missing fields with default values during the deserialization process.
PyMC is a Bayesian probabilistic programming framework used for building probabilistic models and performing Bayesian inference. It provides a probabilistic graphical model library for specifying random variables, priors, and likelihood functions, supported by an MCMC sampling engine and variational inference tools to estimate posterior distributions. The framework features a GPU-accelerated inference backend that compiles models into machine code to increase execution speed. It utilizes a backend-agnostic tensor execution model and just-in-time graph compilation to optimize the computation o
Estimates missing values within datasets using probabilistic frameworks to maintain uncertainty.
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
Fills gaps in extracted feature sets using specialized transformers to maintain compatibility with ML models.
This project is a Python financial analytics framework and quantitative trading library. It provides a suite of mathematical tools for asset pricing, statistical market analysis, and the development of algorithmic trading strategies. The library is distinguished by its focus on currency and commodity correlation modeling, using regression and normalization to identify exchange rate drivers. It features a specialized portfolio optimization engine that applies graph theory, such as clique centrality and degeneracy ordering, alongside quadratic programming to balance risk-adjusted returns. The
Fills gaps in pricing datasets by applying synthetic control methods based on similar economic entities.
Handles missing values natively in raw tabular input without requiring any preprocessing or imputation.
tsai est une bibliothèque de deep learning pour la classification, la régression et la prévision de séries temporelles. Basée sur PyTorch et fastai, elle fournit un framework pour étiqueter des données séquentielles, prédire des valeurs futures dans des séquences univariées ou multivariées, et entraîner des représentations sur des données non étiquetées via l'apprentissage auto-supervisé. La bibliothèque se distingue par ses capacités spécialisées d'ingénierie temporelle et de mise à l'échelle. Elle inclut des outils d'encodage temporel cyclique pour capturer les tendances saisonnières et le découpage de fenêtres en ligne pour traiter des jeux de données dépassant la mémoire disponible. Elle prend également en charge des pipelines d'entrée multimodaux combinant des caractéristiques catégorielles statiques et des séquences continues dynamiques. La boîte à outils couvre un large éventail de besoins en prétraitement et évaluation, notamment la segmentation par fenêtre glissante, l'imputation de données manquantes et la conversion de dataframes tabulaires en tenseurs structurés. La performance des modèles est évaluée par validation croisée glissante et analyse de l'importance des caractéristiques pour garantir la cohérence temporelle.
Fills gaps in sequential datasets using estimation techniques to ensure continuity for downstream modeling.
OSMnx est une bibliothèque Python pour télécharger, modéliser et analyser les réseaux routiers et autres caractéristiques géospatiales à partir d'OpenStreetMap. Elle permet aux utilisateurs de récupérer et de travailler avec des données d'infrastructure du monde réel partout dans le monde, fournissant des outils pour l'analyse de réseau, les requêtes spatiales et la visualisation. La bibliothèque offre des capacités pour travailler avec des caractéristiques urbaines telles que les empreintes de bâtiments, les arrêts de transport en commun et les données d'élévation, ainsi que des statistiques de réseau comme la densité d'intersection et la sinuosité. Elle prend en charge plusieurs modes de déplacement, y compris la conduite, la marche et le vélo, et peut calculer les chemins les plus courts, imputer les vitesses de déplacement et générer des cartes isochrones. Les fonctionnalités supplémentaires incluent le géocodage, la correspondance de cartes, la projection de coordonnées et la possibilité d'enregistrer et de charger des réseaux dans divers formats. OSMnx fournit des outils pour visualiser les réseaux routiers et les caractéristiques géospatiales sous forme de cartes statiques ou de cartes web interactives, et peut tracer des diagrammes figure-fond. La bibliothèque est disponible via les méthodes d'installation de paquets Python standard.
Imputes missing travel speeds and calculates edge travel times for street network routing.
Ce projet est une ressource pédagogique complète sur le machine learning, présentée sous forme d'une série de tutoriels dans des Jupyter Notebooks interactifs. Il propose des implémentations pratiques en Python pour l'ensemble du cycle de vie du machine learning, couvrant l'apprentissage supervisé et non supervisé, le deep learning et l'apprentissage par renforcement. La ressource se distingue par des guides d'implémentation détaillés pour des architectures complexes, notamment les transformers, les réseaux antagonistes génératifs (GAN) et les réseaux de neurones convolutifs. Elle propose également des cours spécialisés pour développer des agents d'apprentissage par renforcement utilisant le Q-learning et les Deep Q-Networks dans des environnements simulés. Le contenu couvre un large spectre de capacités en data science, incluant les pipelines d'ingénierie de données, l'encodage de caractéristiques et la réduction de dimensionnalité. Il fournit un matériel étendu sur l'évaluation des modèles via la validation croisée et des métriques de diagnostic, ainsi que des sujets avancés comme le traitement du langage naturel (NLP), l'analyse de sentiment et l'IA générative. L'ensemble du cursus est conçu pour une exécution interactive dans des Jupyter Notebooks, combinant code exécutable, texte riche et visualisations.
Provides methods for filling gaps in tabular datasets using scalar replacement or statistical propagation.
Ce projet est une collection complète de matériel pédagogique de programmation Python, y compris des tutoriels, des exercices et des exemples de code organisés. Il sert de programme d'apprentissage et de boîte à outils d'ingénierie logicielle, utilisant des Jupyter Notebooks pour combiner du code exécutable avec un texte éducatif descriptif. Le dépôt fournit des guides d'implémentation pratiques pour construire des applications de grand modèle de langage, telles que des systèmes de génération augmentée par récupération, des agents IA avec état et des flux de travail d'apprentissage automatique. Il se distingue en offrant une approche structurée des flux de travail de codage agentique, couvrant la distillation de la fenêtre de contexte, le routage de modèle agnostique au fournisseur et les sorties structurées imposées par schéma. Le matériel couvre un large éventail de capacités d'ingénierie logicielle, notamment la programmation asynchrone avec des files d'attente de tâches distribuées, le développement d'applications web avec des API REST et les flux de travail d'analyse de données. Il inclut également des ressources pour maîtriser la conception orientée objet, implémenter des pipelines CI/CD et appliquer des normes professionnelles de linting et de formatage.
Provides techniques for filling missing values in datasets using scalar replacement or propagation.
Vega-Lite is a high-level declarative language for specifying interactive, multi-view visualizations. It compiles a concise JSON specification into a full Vega visualization, automatically inferring scales, axes, and legends from encoding declarations. The grammar-of-graphics encoding maps data fields to visual channels such as position, color, size, and shape, while a multi-view composition grammar enables layered, faceted, concatenated, and repeated layouts. Reactive parameter binding links named parameters to input widgets, selections, and expressions for dynamic updates. The project suppo
Vega-Lite fills missing data values by generating new data points using a constant value or statistical methods within groups.
This is an interactive notebook-based course that teaches machine learning from Python fundamentals through deep learning and natural language processing. It uses real datasets and multiple frameworks within a structured, hands-on curriculum that combines concise explanations with executable code cells, built-in datasets, and embedded exercise checkpoints. Learning progresses through data preparation and exploration, classical machine learning workflows, computer vision with convolutional neural networks, and natural language processing with deep learning, all delivered as a cohesive progressi
Implements methods for detecting and filling gaps in datasets using scalar replacement and interpolation.
Connexion is a specification-driven framework for building APIs that automatically maps OpenAPI specifications to application logic. It uses these specifications to automate routing, request validation, and response serialization, linking API operations to backend handler functions via operation IDs. The project differentiates itself by providing a schema-driven mock server that simulates API behavior using example responses from the specification without requiring backend logic. It also includes a dynamic documentation hosting system that translates the API specification into a live interact
Populates missing fields in incoming request bodies using default values specified in the API definition.
This project is a collection of comprehensive guides and reference materials designed for technical interviews, machine learning system design, and professional development. It serves as a technical knowledge base and a career coaching manual, providing structured resources to help candidates navigate the machine learning hiring landscape. The resource distinguishes itself by offering detailed frameworks for comparing industry roles, analyzing company types, and planning long-term career progression. It provides specific guidance on evaluating employer organizational health, identifying resea
Detects anomalous data points and decides whether to remove, cap, or transform them.