Why is attic-labs/noms a recommended Content-Based Deduplication GitHub Repositories repository?

Eliminates redundant data by identifying identical entries through cryptographic content hashing.

Why is thysrael/horizon a recommended Content-Based Deduplication GitHub Repositories repository?

Identifies and merges identical stories across multiple platforms using semantic content deduplication.

Why is d4rken-org/sdmaid-se a recommended Content-Based Deduplication GitHub Repositories repository?

Identifies and removes redundant data by generating unique file signatures via cryptographic hashing.

Why is datajuicer/data-juicer a recommended Content-Based Deduplication GitHub Repositories repository?

Removes boilerplate lines, templates, and copyright notices using global frequency analysis.

Why is sjqzhang/go-fastdfs a recommended Content-Based Deduplication GitHub Repositories repository?

Uses SHA1 cryptographic hashing to identify and eliminate redundant identical files.

Why is papra-hq/papra a recommended Content-Based Deduplication GitHub Repositories repository?

Reduces storage waste by detecting identical files via content hashing and storing only one copy.

Why is google/go-containerregistry a recommended Content-Based Deduplication GitHub Repositories repository?

Implements content-based deduplication using cryptographic hashes to identify identical image layers across registries.

Why is openaispace/ai-trend-publish a recommended Content-Based Deduplication GitHub Repositories repository?

Identifies and deletes duplicate stories using vector embeddings to prevent the same topic from appearing multiple times.

9 مستودعات

Awesome GitHub RepositoriesContent-Based Deduplication

Eliminating redundant data by identifying identical files through cryptographic content hashing.

Distinct from Asset Hashing and Deduplication: Closest candidates focused on build artifacts or string matching; this is specifically for file-level deduplication in an archive.

Explore 9 awesome GitHub repositories matching data & databases · Content-Based Deduplication. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

attic-labs/noms
attic-labs/noms
7,422عرض على GitHub
Noms is a distributed version control database and content-addressable data store. It identifies data by cryptographic hashes to ensure integrity and deduplication, while tracking dataset state changes through a sequence of immutable commits to enable branching, forking, and historical recovery. The system functions as a peer-to-peer data synchronizer, reconciling state between disconnected database instances to ensure all nodes converge on the same data. It distinguishes itself as a schema-flexible document store that supports self-describing types, allowing schemas to evolve and widen as ne
Eliminates redundant data by identifying identical entries through cryptographic content hashing.
Go
عرض على GitHub7,422
thysrael/horizon
Thysrael/Horizon
7,357عرض على GitHub
Horizon هو نظام تجميع أخبار مدعوم بالذكاء الاصطناعي مصمم لبناء خطوط أنابيب مخصصة تجلب وتصفي وتثري المعلومات من مصادر ويب متنوعة. يستخدم نماذج لغوية كبيرة لأتمتة تصفية المعلومات، وتسجيل المحتوى لإزالة الضوضاء وتسليط الضوء على القصص عالية القيمة. يدمج النظام بروتوكول سياق النموذج (Model Context Protocol) لكشف مراحل خط الأنابيب كأدوات لمساعدي الذكاء الاصطناعي الخارجيين. يستخدم محولاً موحداً لتوحيد مزودي نماذج الذكاء الاصطناعي المتنوعين لمهام تسجيل المحتوى والتلخيص المتسقة. يجمع خط الأنابيب البيانات من خلاصات RSS، والمنصات الاجتماعية، ومجموعات الأدوات المالية، ومستودعات الكود. يدير المحتوى من خلال إلغاء التكرار، وتصفية الفئات القائمة على الحصص، والإثراء السياقي قبل تقديم إحاطات متعددة اللغات عبر البريد الإلكتروني أو خطافات الويب أو نشر الموقع الساكن. يتم تنسيق سير العمل من خلال أتمتة سحابية متكررة لإدارة الجمع والتقديم المجدول للمعلومات المعالجة.
Identifies and merges identical stories across multiple platforms using semantic content deduplication.
Python
عرض على GitHub7,357
d4rken-org/sdmaid-se
d4rken-org/sdmaid-se
6,995عرض على GitHub
SD Maid SE is an Android storage optimization and system maintenance utility. It focuses on reclaiming disk space by analyzing storage usage and removing duplicate, orphaned, or unused files. The project distinguishes itself through the use of accessibility services to automate repetitive device tasks and manual file reviews by simulating user interactions. It also includes specialized tools for reducing the file size of images and videos through media compression. The system provides a broad range of capabilities including application lifecycle management to freeze or remove software, junk
Identifies and removes redundant data by generating unique file signatures via cryptographic hashing.
Kotlin
عرض على GitHub6,995
datajuicer/data-juicer
datajuicer/data-juicer
6,574عرض على GitHub
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Removes boilerplate lines, templates, and copyright notices using global frequency analysis.
Pythondatadata-analysisdata-pipeline
عرض على GitHub6,574
sjqzhang/go-fastdfs
sjqzhang/go-fastdfs
4,138عرض على GitHub
go-fastdfs هو نظام ملفات موزع وخادم تخزين كائنات مصمم لبناء تخزين سحابي خاص. يوفر تنفيذاً للتخزين متوافقاً مع FastDFS يدير مجموعات من عقد التخزين للتعامل مع عمليات رفع وتنزيل الملفات على نطاق واسع. يركز النظام على التوافر العالي من خلال بنية لا مركزية تقوم تلقائياً بمزامنة البيانات وإصلاح الأعطال عبر أجهزة متعددة دون منسق مركزي. ويدعم بشكل خاص تخزين الملفات القابلة للاستئناف عبر HTTP، مما يسمح بإيقاف عمليات النقل الكبيرة واستئنافها من آخر بايت ناجح للتعامل مع عدم استقرار الشبكة. تشمل الإمكانيات الأساسية تحسين موارد التخزين من خلال إلغاء تكرار المحتوى القائم على SHA1 ودمج الملفات الصغيرة لتقليل استهلاك الـ inode في نظام الملفات. يتضمن المشروع أيضاً خط أنابيب لمعالجة الصور يقوم بالتحجيم وتغيير الحجم الديناميكي للصور أثناء عملية التنزيل ويؤمن الوصول إلى الملفات باستخدام المصادقة القائمة على الرموز (token-based). يمكن نشر النظام عبر حاويات Docker.
Uses SHA1 cryptographic hashing to identify and eliminate redundant identical files.
Gobreakpoint-resumecloud-storagecloudnative
عرض على GitHub4,138
papra-hq/papra
papra-hq/papra
3,838عرض على GitHub
Papra is a self-hosted document management system designed for digital archiving, organization, and retrieval. It serves as a centralized platform for storing files with a focus on security, providing an encrypted file archive using AES-256-GCM and a programmatic interface for managing documents and metadata via a REST API, SDK, and command line tools. The system distinguishes itself through an automated document ingestion engine that imports files via email forwarding, monitored folders, and webhook listeners. It further enhances discoverability by acting as an OCR document indexer, extracti
Reduces storage waste by detecting identical files via content hashing and storing only one copy.
TypeScriptapparchivedocument
عرض على GitHub3,838
google/go-containerregistry
google/go-containerregistry
3,747عرض على GitHub
go-containerregistry is a Go library and toolkit for interacting with OCI and Docker registries. It provides a programmatic implementation of the Open Container Initiative distribution specification to fetch, upload, and manage container images, manifests, and layers. The library functions as a container image manipulation tool and a multi-platform image index manager. It enables the resolution and management of manifest lists that target various hardware architectures and operating systems without requiring a local daemon. The toolkit covers a broad range of registry interactions, including
Implements content-based deduplication using cryptographic hashes to identify identical image layers across registries.
Gocontainercontainer-registrydocker
عرض على GitHub3,747
openaispace/ai-trend-publish
OpenAISpace/ai-trend-publish
2,781عرض على GitHub
This project is an automated content automation pipeline and AI article generator. It uses large language models to research topics from diverse web sources and academic repositories to generate evidence-based text and accompanying AI imagery for digital publishing. The system features a centralized social media management dashboard used to coordinate posting schedules, tone, and account positioning across multiple platforms. It employs a vector-based deduplicator to identify and remove redundant stories from the pipeline and uses topic clustering to rank content based on relevance. The work
Identifies and deletes duplicate stories using vector embeddings to prevent the same topic from appearing multiple times.
TypeScriptaiweixin
عرض على GitHub2,781
pkolaczk/fclones
pkolaczk/fclones
2,633عرض على GitHub
fclones is a command-line tool designed to locate identical files across a filesystem by comparing file sizes and cryptographic hashes. It functions as a parallel filesystem scanner and a deduplication utility that identifies duplicate files to reclaim disk space. The tool distinguishes itself through a persistent hash cache system that stores hashes and metadata on disk to accelerate repeated scans. It employs a multi-phase scanning process and device-aware parallel I/O, which adjusts thread pools based on whether the storage is an SSD or HDD to maximize throughput. Beyond discovery, the pr
Eliminates redundant data by identifying identical files through cryptographic content hashing.
Rust
عرض على GitHub2,633

Awesome Content-Based Deduplication GitHub Repositories

attic-labs/noms

Thysrael/Horizon

d4rken-org/sdmaid-se

datajuicer/data-juicer

sjqzhang/go-fastdfs

papra-hq/papra

google/go-containerregistry

OpenAISpace/ai-trend-publish

pkolaczk/fclones

استكشف الوسوم الفرعية