# ydataai/ydata-profiling

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/ydataai-ydata-profiling).**

13,388 stars · 1,766 forks · Python · mit

## Links

- GitHub: https://github.com/ydataai/ydata-profiling
- Homepage: https://docs.sdk.ydata.ai
- awesome-repositories: https://awesome-repositories.com/repository/ydataai-ydata-profiling.md

## Topics

`big-data-analytics` `data-analysis` `data-exploration` `data-profiling` `data-quality` `data-science` `deep-learning` `eda` `exploration` `exploratory-data-analysis` `hacktoberfest` `html-report` `jupyter` `jupyter-notebook` `machine-learning` `pandas` `pandas-dataframe` `pandas-profiling` `python` `statistics`

## Description

Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments.

The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It incorporates sensitive data governance by identifying and masking personally identifiable information, ensuring that generated reports remain compliant with security standards. Furthermore, the framework supports dataset drift detection by comparing multiple versions of data collections to pinpoint statistical shifts over time.

Beyond its core profiling capabilities, the library offers a modular architecture that allows for schema-driven metadata enrichment and pluggable report rendering. It provides a broad surface for data quality monitoring, including the analysis of temporal trends and the export of metrics into standard formats for integration with other analytical tools.

## Tags

### Data & Databases

- [Data Analysis & Visualization](https://awesome-repositories.com/f/data-databases/data-analysis-visualization.md) — Automates the generation of comprehensive statistical reports and visual summaries from tabular data to facilitate exploratory analysis.
- [Automated Exploratory Analysis](https://awesome-repositories.com/f/data-databases/data-analysis/automated-exploratory-analysis.md) — Provides an automated framework for discovering data distributions, correlations, and quality issues within large datasets.
- [Data Quality Frameworks](https://awesome-repositories.com/f/data-databases/data-quality-frameworks.md) — Monitors data quality by identifying missing values, duplicates, and outliers.
- [Distributed Data Processing](https://awesome-repositories.com/f/data-databases/distributed-data-processing.md) — Scales heavy computational analysis across multiple machines to profile massive datasets.
- [Data Quality Monitors](https://awesome-repositories.com/f/data-databases/data-pipelines/data-quality-monitors.md) — Detects missing values, duplicates, and outliers to ensure data quality. ([source](https://ydata-profiling.ydata.ai/docs/master/))
- [Dataset Comparators](https://awesome-repositories.com/f/data-databases/data-collections-datasets/dataset-comparators.md) — Pinpoints statistical shifts and inconsistencies by comparing multiple versions of data collections. ([source](https://ydata-profiling.ydata.ai/docs/master/))
- [Dataframe Visualizers](https://awesome-repositories.com/f/data-databases/data-engineering/data-visualization-libraries/dataframe-visualizers.md) — Generates comprehensive statistical reports and visual summaries directly from dataframes to identify patterns and quality issues.
- [Data Management & Governance](https://awesome-repositories.com/f/data-databases/data-governance-modeling/data-management-governance.md) — Identifies and masks personally identifiable information within datasets to ensure compliance with security and privacy governance standards.
- [Distributed Task Schedulers](https://awesome-repositories.com/f/data-databases/distributed-task-schedulers.md) — Orchestrates heavy data profiling workloads across distributed computing clusters to handle massive datasets.
- [Distributed Computing Engines](https://awesome-repositories.com/f/data-databases/data-engineering/distributed-compute-frameworks/distributed-computing-engines.md) — Scales data profiling tasks across distributed enterprise environments to handle massive datasets efficiently.
- [Distributed Computing](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/distributed-processing-frameworks/distributed-computing.md) — Distributes heavy computational tasks across multiple machines to profile massive datasets. ([source](https://ydata-profiling.ydata.ai/docs/master/))
- [Stream Processing](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/stream-processing-systems/stream-processing.md) — Processes large datasets in chunks to minimize memory overhead during complex statistical operations.
- [Data Enrichment](https://awesome-repositories.com/f/data-databases/data-enrichment.md) — Enriches datasets with custom business context and descriptive labels to improve report interpretability.

### Business & Productivity Software

- [Statistics Report Exports](https://awesome-repositories.com/f/business-productivity-software/reporting-analytics-tools/statistics-report-exports.md) — Generates comprehensive statistical reports and visual summaries from raw data. ([source](https://ydata-profiling.ydata.ai/docs/master/))

### DevOps & Infrastructure

- [Data Drift Detectors](https://awesome-repositories.com/f/devops-infrastructure/infrastructure-as-code-alerting/drift-detection/data-drift-detectors.md) — Pinpoints statistical shifts and inconsistencies between data versions over time.

### Security & Cryptography

- [Data Masking](https://awesome-repositories.com/f/security-cryptography/data-masking.md) — Automatically identifies and obscures sensitive information to ensure compliance with security standards.
- [Sensitive Content Obscuration](https://awesome-repositories.com/f/security-cryptography/sensitive-data-access-controls/sensitive-content-obscuration.md) — Protects personally identifiable information within datasets to ensure security compliance. ([source](https://ydata-profiling.ydata.ai/docs/master/))

### Software Engineering & Architecture

- [Large Dataset Optimizations](https://awesome-repositories.com/f/software-engineering-architecture/performance-reliability/performance-optimization/data-handling-throughput/large-dataset-optimizations.md) — Provides consistent quality insights across massive datasets in enterprise storage environments. ([source](https://ydata-profiling.ydata.ai/docs/master/))
- [Report Renderers](https://awesome-repositories.com/f/software-engineering-architecture/pluggable-backends/report-renderers.md) — Decouples calculation logic from visual presentation to support pluggable report rendering.

### Scientific & Mathematical Computing

- [Modular Analyzers](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/statistics-probability/statistical-analysis-libraries/statistical-metric-calculators/modular-analyzers.md) — Calculates descriptive metrics through a decoupled pipeline of independent analyzers.

### System Administration & Monitoring

- [Metrics Exporters](https://awesome-repositories.com/f/system-administration-monitoring/metrics-exporters.md) — Exports calculated data statistics into standard formats for integration with other analytical tools. ([source](https://ydata-profiling.ydata.ai/docs/master/))
