16 repositorios
Settings and strategies for handling data ingestion, including chunking and constraint management.
Distinguishing note: Focuses on the configuration of data ingestion pipelines rather than raw storage or database management.
Explore 16 awesome GitHub repositories matching data & databases · Data Processing Configurations. Refine with filters or upvote what's useful.
CrewAI is a multi-agent orchestration framework designed for building autonomous systems that execute complex, multi-step workflows. It provides a development platform where specialized agents are defined with specific roles, goals, and tool sets to perform tasks collaboratively. By leveraging a declarative workflow engine, the system manages task dependencies, state transitions, and execution logic, allowing for the creation of structured, stateful sequences of operations. The framework distinguishes itself through its hierarchical management capabilities, which utilize manager agents to coo
CrewAI manages how files are processed when they exceed provider constraints by selecting modes like strict, auto, or chunking.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Sets global parameters for block sizes and shuffle strategies to control data operations across the cluster.
This project is a Python-based framework that functions as a generative AI agent for programmatic data analysis. It enables users to interact with structured data sources through natural language prompts, translating these requests into executable code to perform analysis, data cleaning, and visualization. By maintaining conversational context across multi-turn interactions, the system allows for iterative exploration and the building of complex data narratives. The framework distinguishes itself through a robust semantic layer and secure execution model. It maps raw datasets to descriptive m
Configures data ingestion and cleaning rules to prepare raw datasets for conversational interaction.
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Outputs processed data streams to external systems such as message queues, databases, or files.
Logstash is a JVM-based event processor and extract, transform, load system designed for log data processing pipelines. It functions as a plugin-based data ingestor that collects, transforms, and delivers logs and event data from multiple sources to various destinations. The system utilizes a modular architecture of interchangeable input, filter, and output components to handle real-time data ingestion and enterprise log aggregation. Users can extend the pipeline's functionality by developing custom plugins to support unique data sources or specific transformation logic. The platform covers
Routes processed events to target indices or external storage systems via destination connectors.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Establishes connections to target storage systems or databases to enable automated delivery of processed data.
Rete is a framework for building interactive, node-based visual interfaces and dataflow programming environments. It provides a core engine that processes directed graphs, allowing developers to define modular logic where nodes represent operations and connections represent the flow of data or control. By decoupling the graph logic from the user interface, the framework enables the creation of custom visual editors that can be integrated into various frontend component libraries. The project distinguishes itself through a highly extensible, signal-driven architecture that supports complex req
Provides hybrid execution models for processing data and control flow through node graphs.
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Supports running data integration tasks across various processing backends to optimize performance.
Mage AI es un orquestador de pipelines de datos basado en Python y un entorno de desarrollo integrado (IDE) de datos autohospedado. Está diseñado para construir, programar y monitorear flujos de trabajo de datos utilizando un diseño de pipeline basado en bloques y una interfaz de cuaderno interactiva. La plataforma se distingue por integrar capacidades de IA generativa, permitiendo a los usuarios conectar proveedores de grandes modelos de lenguaje mediante API para incorporar inteligencia artificial en flujos de datos automatizados. También funciona como un procesador de datos de Apache Spark, gestionando los kernels y la infraestructura necesarios para análisis de alto volumen y procesamiento de datos a gran escala. El sistema cubre una amplia gama de capacidades de ingeniería de datos, incluyendo la automatización de flujos de trabajo ETL, la gestión de modelos dbt y el descubrimiento de flujos de datos. Proporciona herramientas para la integración de control de versiones mediante Git, despliegue en contenedores y control de acceso basado en roles para gestionar pipelines en entornos de desarrollo y producción. El monitoreo se maneja a través de telemetría de rendimiento del sistema y depuración de ejecución de pipelines.
Provides configuration interfaces to push processed datasets into target databases, warehouses, or cloud storage.
CloudQuery is a cloud infrastructure ETL tool and multi-cloud data pipeline designed to collect, synchronize, and normalize resource metadata from various cloud providers and SaaS platforms. It functions as a centralized asset inventory manager and security posture manager, extracting configuration and state data into relational databases, data lakes, or data warehouses. The system distinguishes itself by transforming complex, nested cloud API responses into flat relational tables, enabling the use of standard SQL for asset querying and analysis. It employs a modular plugin system for data ex
Implements driver-based adapters to establish connections and push metadata into various target storage systems and databases.
Cocoindex is an incremental data processing engine that builds and maintains live indexes for AI agents, with a core focus on codebase indexing and knowledge graph extraction. The engine uses a function-graph execution model where user-defined Python functions are composed into a directed acyclic graph, and it processes data incrementally so only changed source records or code paths are re-computed, avoiding full recomputation at any scale. It supports automatic schema inference from transformation pipeline type annotations and provides full data lineage tracing, tagging every output record wi
Exports indexed data to any destination including local files, cloud storage, or REST APIs.
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Supports running Hive queries on Apache Spark for accelerated performance.
dlt es una herramienta de ingesta de datos en Python y framework de pipeline ETL diseñado para obtener datos de diversas fuentes y persistirlos en destinos estructurados. Funciona como un motor de inferencia de esquemas que detecta automáticamente tipos de datos y aplana estructuras JSON anidadas en tablas relacionales, moviendo datos desde fuentes a lakehouses, almacenes de datos o bases de datos vectoriales. El proyecto destaca por la generación de pipelines impulsada por IA, utilizando modelos de lenguaje de gran tamaño para crear código de extracción y conectores para APIs REST. También admite almacenamiento vectorial multimodal y población especializada de bases de datos vectoriales para soportar aplicaciones de IA y machine learning. El framework cubre una amplia gama de capacidades, incluyendo evolución automática de esquemas, carga incremental de datos mediante seguimiento de estado y validación de calidad de datos mediante la aplicación de contratos de datos. Proporciona herramientas para la normalización de datos relacionales, transformaciones pre y post-carga, y una variedad de adaptadores de destino para bases de datos SQL y almacenes de objetos en la nube. La observabilidad se maneja a través de paneles de ejecución de pipelines, seguimiento de linaje de columnas y verificación de versiones de esquema mediante hashes basados en contenido.
Provides connectors to write extracted data into relational databases like Postgres, MySQL, and BigQuery.
Fluvio es una plataforma de streaming de eventos distribuida y motor de streaming nativo de la nube, diseñado para recopilar, persistir y replicar flujos de datos en tiempo real a través de un clúster distribuido. Funciona como un pipeline de datos en tiempo real para construir flujos de trabajo con estado que ingieren, enriquecen y exportan datos entre fuentes y destinos externos. La plataforma se distingue por su uso de WebAssembly para ejecutar módulos compilados para transformaciones y filtrado de datos en línea. Esto permite la ejecución de lógica de negocio personalizada para remodelar la información en movimiento sin requerir un reinicio del clúster. El sistema cubre una amplia gama de capacidades, incluyendo ingesta de datos basada en conectores desde protocolos externos, almacenamiento inmutable estructurado en registros con E/S de copia cero y escalado horizontal del clúster. Admite la creación de pipelines complejos basados en eventos que utilizan procesamiento con estado, agregaciones en ventanas y distribución de datos basada en particiones. El motor puede desplegarse como un binario ligero en diversas arquitecturas de sistema, incluyendo dispositivos IoT ARM64 para procesamiento de datos en el borde (edge).
Ships configuration interfaces for establishing connections to external target storage systems and databases.
DevLake is a DevOps data platform and analytics tool designed to orchestrate data pipelines that ingest, transform, and sync metadata from external development tools into a unified database. It functions as a system for collecting and normalizing data from source control, CI/CD pipelines, and issue trackers into a standardized schema to enable consistent software delivery analytics. The platform distinguishes itself by transforming tool-specific data into a common domain model, allowing for the calculation of engineering metrics via SQL. It provides specialized frameworks for measuring DORA m
Offers a guided process for setting up ingestion parameters to automate how data is gathered from various sources.
Connector-X is a high-performance SQL data extraction library and bridge for transferring relational database records into memory-efficient data structures. It functions as a parallel database connector and federated query engine capable of executing and joining queries across multiple remote database connections to aggregate data locally. The project distinguishes itself through a zero-copy approach to data loading, which transfers SQL query results into memory structures without duplicating data. It maximizes throughput by partitioning SQL queries into threads, employing parallel columnar a
Allows the creation of new output formats by specifying memory allocation and data partitioning during the writing process.