Dedupe

Dedupe is a machine learning tool for entity resolution that identifies and merges duplicate records in structured datasets. It uses active learning to train a matching model from human-labeled examples, learning which field-level similarities are most important for detecting duplicates without requiring manual rule writing. The system combines fingerprint-based blocking to reduce pairwise comparisons, enabling efficient matching on large datasets, and groups scored record pairs into clusters using a configurable similarity threshold.

The tool provides multiple interfaces for different workflows. A command-line tool allows deduplicating or linking records in CSV files without writing any Python code, while a web-based service supports uploading data, training models, and reviewing results without local setup. Trained matching configurations and blocking rules can be saved to files for reuse across sessions without retraining. The system also supports cross-dataset record linking, matching records from separate data sources that refer to the same entity without requiring a shared unique identifier.

Beyond core deduplication, the tool offers capabilities for constructing canonical records from clusters by selecting the most common value for each field, matching messy records against a clean reference dataset, and generating training data from already deduplicated datasets. Custom comparators, data types, and blocking rules can be added for domain-specific matching, and the system is designed to perform intelligent record comparisons that scale to large datasets without requiring a powerful server.

dedupeiodedupe

Features

Features