Why is wandb/wandb a recommended External Data References GitHub Repositories repository?

Links to files in external cloud buckets without uploading them, maintaining integrity through checksum validation.

Why is lancedb/lancedb a recommended External Data References GitHub Repositories repository?

Adds new columns to existing tables by joining with external data sources or using SQL expressions.

Why is hazelcast/hazelcast a recommended External Data References GitHub Repositories repository?

Registers metadata about external data sources to enable querying remote datasets through a unified interface.

Why is wireservice/csvkit a recommended External Data References GitHub Repositories repository?

Joins CSV files on common columns using command-line operations.

Why is maiot-io/zenml a recommended External Data References GitHub Repositories repository?

References and consumes external data by linking files or query results as managed artifacts.

Why is guelfoweb/knock a recommended External Data References GitHub Repositories repository?

Uses configuration files to map external API services as data sources for discovery logic.

Why is medialab/xan a recommended External Data References GitHub Repositories repository?

Combines rows from multiple CSV files using concatenation or join operations.

7 repositorios

Awesome GitHub RepositoriesExternal Data References

Mechanisms for linking to external cloud storage without duplicating data.

Distinct from External Data Integrations: Distinct from general data integration: focuses on referencing external storage via checksums rather than ingestion.

Explore 7 awesome GitHub repositories matching data & databases · External Data References. Refine with filters or upvote what's useful.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

wandb/wandb
wandb/wandb
10,844Ver en GitHub
Wandb is a centralized platform for machine learning experiment tracking, model registry management, and workflow orchestration. It provides a comprehensive suite of tools for logging, visualizing, and versioning training metrics, model artifacts, and hyperparameter sweeps to ensure reproducibility across development cycles. The platform also functions as an observability tool for large language model applications, enabling the tracing of execution steps, token usage, and reasoning processes. The project distinguishes itself through its event-driven automation capabilities, which allow users
Links to files in external cloud buckets without uploading them, maintaining integrity through checksum validation.
Pythonaicollaborationdata-science
Ver en GitHub10,844
lancedb/lancedb
lancedb/lancedb
9,031Ver en GitHub
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Adds new columns to existing tables by joining with external data sources or using SQL expressions.
HTMLapproximate-nearest-neighbor-searchimage-searchnearest-neighbor-search
Ver en GitHub9,031
hazelcast/hazelcast
hazelcast/hazelcast
6,570Ver en GitHub
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Registers metadata about external data sources to enable querying remote datasets through a unified interface.
Javabig-datacachingdata-in-motion
Ver en GitHub6,570
wireservice/csvkit
wireservice/csvkit
6,390Ver en GitHub
csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data. The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without req
Joins CSV files on common columns using command-line operations.
Python
Ver en GitHub6,390
maiot-io/zenml
maiot-io/zenml
5,452Ver en GitHub
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
References and consumes external data by linking files or query results as managed artifacts.
Python
Ver en GitHub5,452
guelfoweb/knock
guelfoweb/knock
4,163Ver en GitHub
Knock es una herramienta de gestión de superficie de ataque y framework de reconocimiento DNS utilizado para descubrir y mapear la infraestructura externa de una organización. Funciona como una herramienta de enumeración de subdominios y escáner de seguridad HTTP para identificar hosts alcanzables y activos organizacionales. El proyecto se distingue por utilizar una estrategia de enumeración híbrida pasiva-activa, combinando búsquedas de API externas con ataques de fuerza bruta de listas de palabras activos y transferencias de zona DNS. Incluye un pipeline de validación de múltiples etapas que detecta registros DNS comodín y verifica la conectividad del host para filtrar falsos positivos. El framework cubre el mapeo de la superficie de ataque, auditoría de seguridad DNS y reconocimiento de vulnerabilidades, incluyendo la detección de protocolos TLS heredados. Los hallazgos se gestionan a través de una base de datos buscable y pueden exportarse como informes HTML o JSON. Las opciones de ajuste de rendimiento permiten el ajuste de los niveles de concurrencia y tiempos de espera de red.
Uses configuration files to map external API services as data sources for discovery logic.
Python
Ver en GitHub4,163
medialab/xan
medialab/xan
3,752Ver en GitHub
Xan is a command-line tool and data transformation engine for processing CSV, TSV, and JSONL datasets. It functions as a processor for compressed files, enabling random access and seeking within gzipped and Zstd files, and serves as a converter for specialized bioinformatics data formats. The tool handles large datasets without requiring full memory loads by utilizing stream-based processing. It provides capabilities for merging, sorting, and deduplicating massive files, as well as converting data between various tabular formats. The project covers a broad range of data wrangling and analysi
Combines rows from multiple CSV files using concatenation or join operations.
Rustclicsvrust
Ver en GitHub3,752

Awesome External Data References GitHub Repositories

wandb/wandb

lancedb/lancedb

hazelcast/hazelcast

wireservice/csvkit

maiot-io/zenml

guelfoweb/knock

medialab/xan

Explorar subetiquetas