14 dépôts
Tools designed to systematically analyze tabular datasets to identify integrity issues and statistical anomalies.
Distinct from Data Observability Profilings: Focuses on tabular data quality profiling, distinct from software quality profiles or media profiles.
Explore 14 awesome GitHub repositories matching data & databases · Data Quality Profilers. Refine with filters or upvote what's useful.
This library provides a diagnostic toolkit for automated data profiling and exploratory analysis. It generates comprehensive statistical summaries and visual reports for tabular datasets, enabling users to identify distribution patterns, missing values, and quality anomalies through a unified interface. The project distinguishes itself by offering differential analysis, which allows for the comparison of two dataset versions to track structural and statistical changes over time. It supports large-scale data processing through lazy evaluation and provides interactive widgets that embed directl
Produces comprehensive statistical summaries and visual charts to detect quality problems and understand data distributions.
This project is an exploratory data analysis framework and profiling tool designed to generate comprehensive statistical reports from Pandas and Spark DataFrames. It functions as a data quality profiler that identifies missing values, duplicates, and high correlations within tabular datasets. The tool distinguishes itself through specialized capabilities for time-series analysis, extracting temporal statistics, seasonality, and auto-correlation plots. It also includes a dataset comparison utility to identify structural or content changes between different versions of a dataset. The analysis
Identifies missing values, duplicates, and high correlations within large tabular datasets.
This project is an exploratory data analysis library and profiling tool for Pandas and Spark DataFrames. It automates the initial investigation of datasets by generating comprehensive descriptive analysis reports, statistical summaries, and data quality warnings. The system functions as a data quality profiler to detect missing values, duplicate rows, and type inconsistencies. It includes a dataset comparison tool for identifying structural and content shifts between different versions of the same data, as well as specialized tools for time-series analysis to calculate auto-correlation and se
Identifies missing values, duplicate rows, and type inconsistencies to ensure tabular dataset integrity.
DataHub is a metadata management system and data catalog platform designed to provide a centralized directory for discovering, managing, and documenting datasets across a diverse data stack. It serves as a comprehensive framework for metadata management, incorporating a data governance framework to classify sensitive information and assign ownership for organizational accountability. The platform distinguishes itself through AI-enabled data discovery, which connects large language models to a metadata graph to allow for natural language search and exploration of data assets. It also provides
Generates metadata profiles including schemas, data statistics, and technical documentation for individual datasets.
Ludwig is a declarative machine learning framework designed for training neural networks and large language models using configuration files instead of manual coding. It functions as a multimodal model builder and a low-code tool for supervised fine-tuning, allowing users to build models that process mixed inputs of text, images, audio, and tabular data. The project distinguishes itself through an automated hyperparameter optimizer and a system for large language model fine-tuning using parameter-efficient adapters. It features a multimodal data pipeline and the ability to automatically gener
Analyzes data frames for missing values and class imbalance to ensure data integrity before training.
This project is a plugin framework and agentic workflow library designed to connect large language models to professional toolstacks. It provides a system for integrating language models with external data warehouses, CRMs, and other enterprise software to retrieve and manipulate real-time business data. The framework enables the automation of specialized professional tasks through a file-based plugin definition system. It allows for the customization of domain expertise and plugin behavior to align with internal company processes, supported by an enterprise data connector that links models t
Profiles datasets to identify patterns and quality issues through automated data exploration.
Feast is a machine learning feature store and MLOps data infrastructure layer. It provides a centralized system for managing and serving features across offline training and online production environments, utilizing an online feature serving layer for low-latency retrieval. The project centers on a feature registry that acts as a central catalog for defining, governing, and discovering feature services. It employs a unified data access layer to decouple feature retrieval from physical storage and includes a point-in-time data generator to create historically accurate training datasets that pr
Includes integrated quality frameworks to profile and validate feature data to maintain overall data integrity.
Evidently is an AI observability platform and evaluation framework designed to quantify the performance of machine learning models and large language models. It functions as a monitoring tool for detecting data drift and quality degradation in tabular datasets, while providing a specialized analyzer for the faithfulness and correctness of retrieval augmented generation systems. The project distinguishes itself through an evaluation framework that utilizes judge models and custom rubrics to score language model outputs. It includes tools for iterative prompt optimization and the generation of
Analyzes tabular datasets for missing values and descriptive statistics to ensure input data integrity.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Feast generates a statistical profile of a dataset, capturing metrics like column means and quantiles for later validation.
RedPajama-Data est un ensemble d'outils pour le prétraitement de jeux de données textuels à grande échelle utilisés pour entraîner des grands modèles de langage. Il fournit un pipeline de prétraitement axé sur le nettoyage, la déduplication et la notation de collections massives de textes pour garantir la qualité et la diversité des données. Le projet utilise un framework de notation de la qualité des documents qui emploie le machine learning et des heuristiques statistiques pour évaluer si les documents sont adaptés à l'entraînement. Il inclut un pipeline de filtrage de jeux de données qui utilise des classificateurs et des listes de blocage pour supprimer les mots ou URLs indésirables. Le système dispose d'un ensemble d'outils de déduplication de texte qui élimine le contenu redondant en utilisant des techniques de correspondance exacte et floue. Ces capacités permettent l'identification et la suppression de documents en double ou presque identiques à travers un corpus.
Generates quality metrics and unique signatures to identify nearly identical content across a dataset.
Amundsen is a data catalog and discovery platform that provides a centralized directory for indexing tables and dashboards. It functions as a metadata management system and search engine, allowing users to locate and understand available data assets across diverse distributed sources. The platform includes capabilities for data lineage tracking to map the origin and movement of datasets between systems. It also serves as a data profiling tool, calculating distribution and quality statistics for individual table columns to provide automated insights into the nature of the data. The system man
Calculates distribution and quality statistics for table columns to provide automated data quality insights.
Visual Insights est une plateforme d'analyse exploratoire de données automatisée et un outil d'inférence causale conçu pour découvrir des modèles et des relations de cause à effet au sein des jeux de données. Il fonctionne comme une bibliothèque de visualisation de données interactive utilisant une approche de grammaire graphique pour générer des graphiques et des tableaux de bord multidimensionnels. Le projet se distingue par une interface en langage naturel qui traduit les questions en texte brut en réponses de données et visualisations via un modèle de langage. Il fournit un framework spécialisé pour la découverte et l'inférence causales, permettant aux utilisateurs d'identifier les liens entre variables via des graphes causaux interactifs et d'effectuer des analyses de type « et si » pour valider des hypothèses. La plateforme couvre un large éventail de capacités, incluant le nettoyage visuel des données, le profilage statistique et la transformation automatisée des jeux de données. Elle prend en charge l'intégration de données diverses provenant de fichiers locaux et de bases de données distantes, et dispose d'un moteur de traitement haute performance pour gérer de grands jeux de données localement. De plus, le système permet l'intégration de composants d'analyse interactifs dans des applications web et des notebooks.
Generates summaries and statistical views of data sources to understand distribution and quality.
Ce projet est une collection de supports de référence et de directives pour implémenter des frameworks d'audit de données. Il sert de guide de référence sur la qualité des données et de manuel de validation de jeux de données pour identifier les erreurs structurelles et statistiques courantes dans les jeux de données. Le projet fournit une base de connaissances structurée pour le nettoyage des données, présentant un catalogue d'erreurs de données réelles et des stratégies pratiques pour leur détection et leur résolution. Il inclut des frameworks spécifiques pour évaluer la provenance des données et la fiabilité des informations agrégées. Le matériel couvre un large éventail de capacités d'analyse de données, incluant la validation de l'intégrité statistique pour détecter la manipulation, des évaluations de la validité de l'échantillonnage pour identifier les biais de population, et des méthodes pour la détection d'erreurs structurelles telles que les problèmes d'encodage. Il décrit également des processus pour récupérer des informations tabulaires à partir de documents visuels via la reconnaissance optique de caractères (OCR).
References common real-world data errors and applies methods to resolve or mitigate those issues.
qsv is a high-performance command line toolkit for querying, transforming, and analyzing comma-separated value files. It functions as a data wrangling interface and a tabular data profiler, featuring a query engine capable of executing SQL statements and joins directly on flat files without requiring a database. The project is distinguished by its ability to process massive datasets that exceed available system memory. This is achieved through disk-based external memory processing, including multithreaded merge sorting, on-disk hash tables for deduplication, and lightweight file indexing for
Analyzes tabular datasets to calculate summary statistics, frequency distributions, and infer data schemas.