6 dépôts
Mapping and linking identities across different datasets to create a unified view of entities.
Distinct from Entity Linking: Focuses on the process of resolving identity across datasets rather than NLP entity linking or ORM identity maps
Explore 6 awesome GitHub repositories matching data & databases · Entity Resolution. Refine with filters or upvote what's useful.
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
Maps and links identities across different datasets to create a unified view of entities.
Tiny Universe is an educational monorepo that delivers multiple independent implementations of core AI subsystems as self-contained Jupyter notebooks. It provides from-scratch constructions of foundational architectures including a complete Transformer model built from the original paper specification, a denoising diffusion probabilistic model for image generation, and a ReAct-style autonomous agent framework that equips an LLM with tools for planning and multi-step task execution. The project distinguishes itself by covering the full lifecycle of modern AI systems through hands-on implementa
Identifies and merges multiple references to the same real-world entity using LLM comparison.
Dedupe is a machine learning tool for entity resolution that identifies and merges duplicate records in structured datasets. It uses active learning to train a matching model from human-labeled examples, learning which field-level similarities are most important for detecting duplicates without requiring manual rule writing. The system combines fingerprint-based blocking to reduce pairwise comparisons, enabling efficient matching on large datasets, and groups scored record pairs into clusters using a configurable similarity threshold. The tool provides multiple interfaces for different workfl
Merges or links matched records into a single canonical representation, removing redundancy from a dataset.
apollo-ios is a GraphQL client library for iOS and Apple platforms that enables type-safe network communication. It transforms GraphQL operations into generated Swift models, ensuring that network responses are validated at compile time to eliminate manual mapping. The library features a normalized cache manager that stores entities in a flat structure to maintain data consistency across different application views. It also optimizes network performance using hash-based persisted queries to reduce payload sizes and supports real-time data streaming via WebSockets or HTTP subscriptions. The p
Resolves entity references by using foreign keys from one API to trigger data resolution in another.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
A feature that resolves variations of the same entity into a single canonical name using embedding-based blocking and comparison.
FalkorDB is a high-performance graph database management system and vector graph database. It serves as a knowledge graph construction tool and a GraphRAG knowledge store, integrating structured property graphs with vector search to provide grounded context for large language models. The engine is designed as a multi-tenant graph engine, capable of hosting thousands of isolated datasets within a single instance. The system distinguishes itself by using linear algebra for query execution, treating relationship tensors as matrix multiplications to achieve low-latency multi-hop traversals. It ut
Merges duplicate nodes representing the same real-world object to maintain a unified source of truth.