# dedupeio/dedupe

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/dedupeio-dedupe).**

4,442 stars · 570 forks · Python · mit

## Links

- GitHub: https://github.com/dedupeio/dedupe
- Homepage: https://docs.dedupe.io
- awesome-repositories: https://awesome-repositories.com/repository/dedupeio-dedupe.md

## Topics

`clustering` `datamade` `de-duplicating` `dedupe` `dedupe-library` `entity-resolution` `python` `python-library` `record-linkage`

## Description

Dedupe is a machine learning tool for entity resolution that identifies and merges duplicate records in structured datasets. It uses active learning to train a matching model from human-labeled examples, learning which field-level similarities are most important for detecting duplicates without requiring manual rule writing. The system combines fingerprint-based blocking to reduce pairwise comparisons, enabling efficient matching on large datasets, and groups scored record pairs into clusters using a configurable similarity threshold.

The tool provides multiple interfaces for different workflows. A command-line tool allows deduplicating or linking records in CSV files without writing any Python code, while a web-based service supports uploading data, training models, and reviewing results without local setup. Trained matching configurations and blocking rules can be saved to files for reuse across sessions without retraining. The system also supports cross-dataset record linking, matching records from separate data sources that refer to the same entity without requiring a shared unique identifier.

Beyond core deduplication, the tool offers capabilities for constructing canonical records from clusters by selecting the most common value for each field, matching messy records against a clean reference dataset, and generating training data from already deduplicated datasets. Custom comparators, data types, and blocking rules can be added for domain-specific matching, and the system is designed to perform intelligent record comparisons that scale to large datasets without requiring a powerful server.

## Tags

### Artificial Intelligence & ML

- [Entity Deduplication Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-tooling/entity-deduplication-tools.md) — Ships a machine learning tool that identifies and merges duplicate records in structured datasets using active learning.
- [Active Learning Training Workflows](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/active-learning-training-workflows.md) — Trains a machine learning model from human-labeled examples to identify and merge duplicate records in structured datasets. ([source](https://cdn.jsdelivr.net/gh/dedupeio/dedupe@main/README.md))
- [Entity Clustering and Canonicalization](https://awesome-repositories.com/f/artificial-intelligence-ml/clustering-tools/entity-clustering-and-canonicalization.md) — Uses learned similarity metrics to compare record pairs, group them into clusters, and construct canonical records.
- [Custom Deduplication Rule Trainers](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-deduplication-rule-trainers.md) — Learns optimal matching weights and blocking rules from human-labeled example pairs. ([source](https://docs.dedupe.io/))
- [Matching Model Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/training-algorithms/machine-learning-optimization/matching-model-optimization.md) — Presents the most uncertain record pairs for a user to label as match or distinct, incrementally improving the model. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Entity Resolution Pair Scorers](https://awesome-repositories.com/f/artificial-intelligence-ml/vector-embeddings/sentence-embeddings/sentence-pair-scoring/entity-resolution-pair-scorers.md) — Computes a probability score for each pair of records indicating how likely they are to refer to the same entity. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Model Persistence](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/data-and-checkpointing/model-loading/model-persistence.md) — Saves trained matching configurations and blocking rules to a settings file for consistent reuse across sessions.
- [Blocking-Based Pair Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/paired-image-translation/training-pair-generation/blocking-based-pair-generators.md) — Yields pairs of records that share blocking fingerprints, reducing the number of comparisons needed for large datasets. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Training Data Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/training-data-generation.md) — Builds labeled training examples from an already deduplicated or linked dataset that has a common key identifying matches. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))

### Data & Databases

- [Entity Resolution](https://awesome-repositories.com/f/data-databases/entity-resolution.md) — Merges or links matched records into a single canonical representation, removing redundancy from a dataset. ([source](https://docs.dedupe.io/Examples.html))
- [Blocking-Based Record Matchers](https://awesome-repositories.com/f/data-databases/blocking-based-record-matchers.md) — Reduces the number of record comparisons in large datasets by using fingerprint-based blocking to improve performance.
- [Fuzzy Matching](https://awesome-repositories.com/f/data-databases/fuzzy-matching.md) — Compares pairs of records using learned fuzzy rules and groups those that likely refer to the same real-world entity. ([source](https://docs.dedupe.io/Examples.html))
- [Machine Learning Deduplicators](https://awesome-repositories.com/f/data-databases/fuzzy-matching/machine-learning-deduplicators.md) — Uses machine learning to identify and merge duplicate entries in structured data based on learned rules. ([source](https://docs.dedupe.io/))
- [Inter-Dataset Record Linkage](https://awesome-repositories.com/f/data-databases/inter-dataset-record-linkage.md) — Matches records from separate data sources that refer to the same real-world entity without shared unique identifiers.
- [Intra-Dataset Deduplication](https://awesome-repositories.com/f/data-databases/intra-dataset-deduplication.md) — Identifies records that refer to the same entity within one dataset and groups them into clusters with confidence scores. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Cross-Dataset Record Matchers](https://awesome-repositories.com/f/data-databases/large-scale-data-computation/scalable-record-matching/cross-dataset-record-matchers.md) — Finds pairs of records that refer to the same entity across two separate datasets, supporting one-to-one, many-to-one, and many-to-many constraints. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Duplicate Record Groupers](https://awesome-repositories.com/f/data-databases/large-scale-data-computation/scalable-record-matching/duplicate-record-groupers.md) — Identifies and groups records across a dataset that likely represent the same entity based on learned matching rules. ([source](https://docs.dedupe.io/_sources/index.rst.txt))
- [Record Clustering](https://awesome-repositories.com/f/data-databases/record-clustering.md) — Groups scored record pairs into clusters of records that all refer to the same entity. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Canonical Data Matching](https://awesome-repositories.com/f/data-databases/canonical-data-matching.md) — Indexes a clean reference dataset and searches messy records against it to find the best matching canonical records. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Canonical Record Generation](https://awesome-repositories.com/f/data-databases/canonical-record-generation.md) — Creates a single representative record from a cluster of duplicates by selecting the most common field values.
- [Canonical Record Synthesis](https://awesome-repositories.com/f/data-databases/canonical-record-synthesis.md) — Creates a single representative record from a cluster by selecting the most common value for each field.
- [Laptop-Scale Deduplication](https://awesome-repositories.com/f/data-databases/intra-dataset-deduplication/laptop-scale-deduplication.md) — Performs intelligent record comparisons that scale to large datasets without requiring a powerful server. ([source](https://docs.dedupe.io/))
- [Custom Data Comparators](https://awesome-repositories.com/f/data-databases/value-comparators/epsilon-based-type-comparators/custom-data-comparators.md) — Supports adding custom data types, string comparators, and blocking rules for domain-specific matching. ([source](https://docs.dedupe.io/))

### Part of an Awesome List

- [Iterative Human-in-the-Loop Labeling](https://awesome-repositories.com/f/awesome-lists/ai/active-learning/iterative-human-in-the-loop-labeling.md) — Presents uncertain record pairs in the console for user labeling to refine the matching model. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Terminal-Based Labeling Interfaces](https://awesome-repositories.com/f/awesome-lists/ai/active-learning/iterative-human-in-the-loop-labeling/terminal-based-labeling-interfaces.md) — Presents uncertain record pairs in the terminal for user labeling to incrementally improve the matching model.
- [Entity Resolution Model Checkpoints](https://awesome-repositories.com/f/awesome-lists/ai/pre-trained-models/entity-resolution-model-checkpoints.md) — Restores a previously trained matching model from a settings file for prediction without retraining. ([source](https://docs.dedupe.io/en/latest/API-documentation.html))
- [Natural Language Processing](https://awesome-repositories.com/f/awesome-lists/ai/natural-language-processing.md) — Library for fuzzy matching and record entity resolution.

### Development Tools & Productivity

- [CSV Deduplication Utilities](https://awesome-repositories.com/f/development-tools-productivity/csv-command-line-toolkits/csv-deduplication-utilities.md) — Runs a command-line tool that deduplicates or links records in CSV files without writing Python code. ([source](https://cdn.jsdelivr.net/gh/dedupeio/dedupe@main/README.md))
- [Active Learning Matchers](https://awesome-repositories.com/f/development-tools-productivity/example-based-matching/active-learning-matchers.md) — Trains matching rules from human-labeled examples to automate duplicate detection across datasets.
- [Field-Level Similarity Learners](https://awesome-repositories.com/f/development-tools-productivity/example-based-matching/active-learning-matchers/field-level-similarity-learners.md) — Learns from labeled examples which field-level similarities are most important for identifying duplicates in a specific dataset. ([source](https://docs.dedupe.io/how-it-works/How-it-works.html))

### Programming Languages & Runtimes

- [Learned Similarity Metrics](https://awesome-repositories.com/f/programming-languages-runtimes/programming-utilities/string-utilities/string-manipulators/edit-distance-calculators/string-similarity-metrics/learned-similarity-metrics.md) — Provides trainable similarity metrics that learn field-level weights from labeled examples for entity matching.

### Web Development

- [Fuzzy Entity Deduplicators](https://awesome-repositories.com/f/web-development/api-request-deduplication/rule-group-result-deduplications/profile-record-deduplicators/fuzzy-entity-deduplicators.md) — Identifies and merges entries that refer to the same real-world entity, even when names or addresses differ slightly. ([source](https://cdn.jsdelivr.net/gh/dedupeio/dedupe@main/README.md))
