7 repositorios
Mechanisms for linking to external cloud storage without duplicating data.
Distinct from External Data Integrations: Distinct from general data integration: focuses on referencing external storage via checksums rather than ingestion.
Explore 7 awesome GitHub repositories matching data & databases · External Data References. Refine with filters or upvote what's useful.
Wandb is a centralized platform for machine learning experiment tracking, model registry management, and workflow orchestration. It provides a comprehensive suite of tools for logging, visualizing, and versioning training metrics, model artifacts, and hyperparameter sweeps to ensure reproducibility across development cycles. The platform also functions as an observability tool for large language model applications, enabling the tracing of execution steps, token usage, and reasoning processes. The project distinguishes itself through its event-driven automation capabilities, which allow users
Links to files in external cloud buckets without uploading them, maintaining integrity through checksum validation.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Adds new columns to existing tables by joining with external data sources or using SQL expressions.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Registers metadata about external data sources to enable querying remote datasets through a unified interface.
csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data. The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without req
Joins CSV files on common columns using command-line operations.
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
References and consumes external data by linking files or query results as managed artifacts.
Knock es una herramienta de gestión de superficie de ataque y framework de reconocimiento DNS utilizado para descubrir y mapear la infraestructura externa de una organización. Funciona como una herramienta de enumeración de subdominios y escáner de seguridad HTTP para identificar hosts alcanzables y activos organizacionales. El proyecto se distingue por utilizar una estrategia de enumeración híbrida pasiva-activa, combinando búsquedas de API externas con ataques de fuerza bruta de listas de palabras activos y transferencias de zona DNS. Incluye un pipeline de validación de múltiples etapas que detecta registros DNS comodín y verifica la conectividad del host para filtrar falsos positivos. El framework cubre el mapeo de la superficie de ataque, auditoría de seguridad DNS y reconocimiento de vulnerabilidades, incluyendo la detección de protocolos TLS heredados. Los hallazgos se gestionan a través de una base de datos buscable y pueden exportarse como informes HTML o JSON. Las opciones de ajuste de rendimiento permiten el ajuste de los niveles de concurrencia y tiempos de espera de red.
Uses configuration files to map external API services as data sources for discovery logic.
Xan is a command-line tool and data transformation engine for processing CSV, TSV, and JSONL datasets. It functions as a processor for compressed files, enabling random access and seeking within gzipped and Zstd files, and serves as a converter for specialized bioinformatics data formats. The tool handles large datasets without requiring full memory loads by utilizing stream-based processing. It provides capabilities for merging, sorting, and deduplicating massive files, as well as converting data between various tabular formats. The project covers a broad range of data wrangling and analysi
Combines rows from multiple CSV files using concatenation or join operations.