2 个仓库
Generating hash values for serialized documents to detect changes or verify integrity.
Distinguishing note: No candidate covers general document integrity hashing; existing ones are for ZK-circuits or bytecode translation.
Explore 2 awesome GitHub repositories matching data & databases · Document Content Hashing. Refine with filters or upvote what's useful.
ArduinoJson is a C++ library for parsing and manipulating JSON data and MessagePack binary streams on microcontrollers with limited memory and processing power. It provides the core primitives necessary for embedded data serialization and parsing, enabling devices to exchange structured data over serial or network interfaces. The library is distinguished by its focus on microcontroller memory management, employing strategies such as pool-based allocation, string deduplication, and non-owning string views to minimize RAM usage. It further optimizes for constrained environments by allowing cons
Generates a hash of a serialized JSON document for integrity checks or change detection.
RedPajama-Data 是一个用于预处理训练大语言模型所需的大规模文本数据集的工具集。它提供了一个专注于清洗、去重和评分海量文本集合的预处理流水线,以确保数据质量和多样性。 该项目利用文档质量评分框架,采用机器学习和统计启发式方法来评估文档是否适合训练。它包括一个数据集过滤流水线,使用分类器和黑名单来删除不良词汇或 URL。 该系统具有文本去重工具集,使用精确和模糊匹配技术消除冗余内容。这些功能允许识别和删除语料库中重复或几乎相同的文档。
Generates unique fingerprints for documents to detect redundancy and track content across different data sources.