5 dépôts
Unified storage repositories for managing large-scale analytical datasets.
Distinguishing note: Focuses on cloud-native integration for analytical processing.
Explore 5 awesome GitHub repositories matching data & databases · Data Lakes. Refine with filters or upvote what's useful.
SeaweedFS is a distributed object store and high-performance file system designed to manage massive volumes of unstructured data. It utilizes a decoupled architecture that separates metadata management from raw data storage, allowing for independent scalability and the efficient handling of billions of files. By providing a POSIX-compliant interface, it enables applications to interact with a unified namespace while maintaining the performance characteristics of a distributed object store. The system distinguishes itself through a multi-region data fabric that supports active-active replicati
Manages large-scale analytical datasets in a unified storage layer for modern data processing.
Hub is a multimodal AI data lake and vector database designed for storing and querying embeddings, text, audio, and images. It functions as a dataset version control system and a machine learning data streaming engine to support large-scale model training. The system utilizes a serverless PostgreSQL vector store to index high-dimensional embeddings for semantic search. It provides a visual interface for inspecting multimodal datasets and viewing annotations such as bounding boxes and masks. The platform handles cloud-agnostic storage synchronization and implements lazy, compressed data strea
Provides a scalable multimodal data lake for organizing and retrieving large datasets for AI training.
DeepLake is AI data infrastructure consisting of a multimodal data lake, a hybrid search engine, and a serverless vector database. It provides a PostgreSQL-based AI data runtime that combines multimodal storage with streaming pipelines to load and shuffle datasets from cloud storage directly into deep learning training pipelines. The system utilizes lazy indexing to store and slice images, audio, and video without loading entire files into memory. It enables retrieval-augmented generation by persisting high-dimensional embeddings in a serverless vector store and implementing hybrid search tha
Manages multimodal AI data types optimized for deep learning using lazy loading to prevent memory overflow.
Scorecard est un scanner de sécurité open source et un outil d'analyse de la chaîne d'approvisionnement logicielle qui évalue la posture de sécurité des projets en calculant des métriques de risque basées sur les meilleures pratiques. Il fonctionne comme un tableau de bord de santé de sécurité, visualisant les failles de sécurité via des scores et des badges pour aider les mainteneurs à identifier les vulnérabilités. Le projet fournit un système pour surveiller la sécurité des dépôts via un auditeur de sécurité GitHub Action qui alerte les mainteneurs lorsque les scores de sécurité chutent. Il offre également un mécanisme de guidage pour la remédiation des vulnérabilités, mappant les failles de sécurité identifiées à des instructions prescriptives pour améliorer les pratiques de développement. L'outil couvre une large surface de capacités, notamment l'audit de sécurité open source, l'automatisation de la sécurité CI/CD et l'analyse de dépôts tiers pour évaluer le risque avant intégration. Il prend en charge diverses interfaces pour l'interaction, incluant une interface en ligne de commande pour le scan et une interface REST pour récupérer des métriques de sécurité précalculées.
Stores aggregated security scan results in a public BigQuery dataset for large-scale analysis.
SlateDB is a cloud-native key-value store and distributed database engine that utilizes a log-structured merge-tree architecture. It serves as a transactional storage layer designed to persist data directly to cloud object storage. The engine differentiates itself by optimizing read performance for remote storage through the use of bloom filters and multi-level block caching. It employs a single-writer multi-reader model and provides the ability to create zero-copy clones via copy-on-write checkpointing. The system supports atomic transactions, range queries, and snapshot-based concurrency c
Serves as a high-performance indexing layer for large-scale analytical datasets stored in cloud data lakes.