9 مستودعات
Eliminating redundant data by identifying identical files through cryptographic content hashing.
Distinct from Asset Hashing and Deduplication: Closest candidates focused on build artifacts or string matching; this is specifically for file-level deduplication in an archive.
Explore 9 awesome GitHub repositories matching data & databases · Content-Based Deduplication. Refine with filters or upvote what's useful.
Noms is a distributed version control database and content-addressable data store. It identifies data by cryptographic hashes to ensure integrity and deduplication, while tracking dataset state changes through a sequence of immutable commits to enable branching, forking, and historical recovery. The system functions as a peer-to-peer data synchronizer, reconciling state between disconnected database instances to ensure all nodes converge on the same data. It distinguishes itself as a schema-flexible document store that supports self-describing types, allowing schemas to evolve and widen as ne
Eliminates redundant data by identifying identical entries through cryptographic content hashing.
Horizon هو نظام تجميع أخبار مدعوم بالذكاء الاصطناعي مصمم لبناء خطوط أنابيب مخصصة تجلب وتصفي وتثري المعلومات من مصادر ويب متنوعة. يستخدم نماذج لغوية كبيرة لأتمتة تصفية المعلومات، وتسجيل المحتوى لإزالة الضوضاء وتسليط الضوء على القصص عالية القيمة. يدمج النظام بروتوكول سياق النموذج (Model Context Protocol) لكشف مراحل خط الأنابيب كأدوات لمساعدي الذكاء الاصطناعي الخارجيين. يستخدم محولاً موحداً لتوحيد مزودي نماذج الذكاء الاصطناعي المتنوعين لمهام تسجيل المحتوى والتلخيص المتسقة. يجمع خط الأنابيب البيانات من خلاصات RSS، والمنصات الاجتماعية، ومجموعات الأدوات المالية، ومستودعات الكود. يدير المحتوى من خلال إلغاء التكرار، وتصفية الفئات القائمة على الحصص، والإثراء السياقي قبل تقديم إحاطات متعددة اللغات عبر البريد الإلكتروني أو خطافات الويب أو نشر الموقع الساكن. يتم تنسيق سير العمل من خلال أتمتة سحابية متكررة لإدارة الجمع والتقديم المجدول للمعلومات المعالجة.
Identifies and merges identical stories across multiple platforms using semantic content deduplication.
SD Maid SE is an Android storage optimization and system maintenance utility. It focuses on reclaiming disk space by analyzing storage usage and removing duplicate, orphaned, or unused files. The project distinguishes itself through the use of accessibility services to automate repetitive device tasks and manual file reviews by simulating user interactions. It also includes specialized tools for reducing the file size of images and videos through media compression. The system provides a broad range of capabilities including application lifecycle management to freeze or remove software, junk
Identifies and removes redundant data by generating unique file signatures via cryptographic hashing.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Removes boilerplate lines, templates, and copyright notices using global frequency analysis.
go-fastdfs هو نظام ملفات موزع وخادم تخزين كائنات مصمم لبناء تخزين سحابي خاص. يوفر تنفيذاً للتخزين متوافقاً مع FastDFS يدير مجموعات من عقد التخزين للتعامل مع عمليات رفع وتنزيل الملفات على نطاق واسع. يركز النظام على التوافر العالي من خلال بنية لا مركزية تقوم تلقائياً بمزامنة البيانات وإصلاح الأعطال عبر أجهزة متعددة دون منسق مركزي. ويدعم بشكل خاص تخزين الملفات القابلة للاستئناف عبر HTTP، مما يسمح بإيقاف عمليات النقل الكبيرة واستئنافها من آخر بايت ناجح للتعامل مع عدم استقرار الشبكة. تشمل الإمكانيات الأساسية تحسين موارد التخزين من خلال إلغاء تكرار المحتوى القائم على SHA1 ودمج الملفات الصغيرة لتقليل استهلاك الـ inode في نظام الملفات. يتضمن المشروع أيضاً خط أنابيب لمعالجة الصور يقوم بالتحجيم وتغيير الحجم الديناميكي للصور أثناء عملية التنزيل ويؤمن الوصول إلى الملفات باستخدام المصادقة القائمة على الرموز (token-based). يمكن نشر النظام عبر حاويات Docker.
Uses SHA1 cryptographic hashing to identify and eliminate redundant identical files.
Papra is a self-hosted document management system designed for digital archiving, organization, and retrieval. It serves as a centralized platform for storing files with a focus on security, providing an encrypted file archive using AES-256-GCM and a programmatic interface for managing documents and metadata via a REST API, SDK, and command line tools. The system distinguishes itself through an automated document ingestion engine that imports files via email forwarding, monitored folders, and webhook listeners. It further enhances discoverability by acting as an OCR document indexer, extracti
Reduces storage waste by detecting identical files via content hashing and storing only one copy.
go-containerregistry is a Go library and toolkit for interacting with OCI and Docker registries. It provides a programmatic implementation of the Open Container Initiative distribution specification to fetch, upload, and manage container images, manifests, and layers. The library functions as a container image manipulation tool and a multi-platform image index manager. It enables the resolution and management of manifest lists that target various hardware architectures and operating systems without requiring a local daemon. The toolkit covers a broad range of registry interactions, including
Implements content-based deduplication using cryptographic hashes to identify identical image layers across registries.
This project is an automated content automation pipeline and AI article generator. It uses large language models to research topics from diverse web sources and academic repositories to generate evidence-based text and accompanying AI imagery for digital publishing. The system features a centralized social media management dashboard used to coordinate posting schedules, tone, and account positioning across multiple platforms. It employs a vector-based deduplicator to identify and remove redundant stories from the pipeline and uses topic clustering to rank content based on relevance. The work
Identifies and deletes duplicate stories using vector embeddings to prevent the same topic from appearing multiple times.
fclones is a command-line tool designed to locate identical files across a filesystem by comparing file sizes and cryptographic hashes. It functions as a parallel filesystem scanner and a deduplication utility that identifies duplicate files to reclaim disk space. The tool distinguishes itself through a persistent hash cache system that stores hashes and metadata on disk to accelerate repeated scans. It employs a multi-phase scanning process and device-aware parallel I/O, which adjusts thread pools based on whether the storage is an SSD or HDD to maximize throughput. Beyond discovery, the pr
Eliminates redundant data by identifying identical files through cryptographic content hashing.