16 dépôts
Settings and strategies for handling data ingestion, including chunking and constraint management.
Distinguishing note: Focuses on the configuration of data ingestion pipelines rather than raw storage or database management.
Explore 16 awesome GitHub repositories matching data & databases · Data Processing Configurations. Refine with filters or upvote what's useful.
CrewAI is a multi-agent orchestration framework designed for building autonomous systems that execute complex, multi-step workflows. It provides a development platform where specialized agents are defined with specific roles, goals, and tool sets to perform tasks collaboratively. By leveraging a declarative workflow engine, the system manages task dependencies, state transitions, and execution logic, allowing for the creation of structured, stateful sequences of operations. The framework distinguishes itself through its hierarchical management capabilities, which utilize manager agents to coo
CrewAI manages how files are processed when they exceed provider constraints by selecting modes like strict, auto, or chunking.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Sets global parameters for block sizes and shuffle strategies to control data operations across the cluster.
This project is a Python-based framework that functions as a generative AI agent for programmatic data analysis. It enables users to interact with structured data sources through natural language prompts, translating these requests into executable code to perform analysis, data cleaning, and visualization. By maintaining conversational context across multi-turn interactions, the system allows for iterative exploration and the building of complex data narratives. The framework distinguishes itself through a robust semantic layer and secure execution model. It maps raw datasets to descriptive m
Configures data ingestion and cleaning rules to prepare raw datasets for conversational interaction.
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Outputs processed data streams to external systems such as message queues, databases, or files.
Logstash is a JVM-based event processor and extract, transform, load system designed for log data processing pipelines. It functions as a plugin-based data ingestor that collects, transforms, and delivers logs and event data from multiple sources to various destinations. The system utilizes a modular architecture of interchangeable input, filter, and output components to handle real-time data ingestion and enterprise log aggregation. Users can extend the pipeline's functionality by developing custom plugins to support unique data sources or specific transformation logic. The platform covers
Routes processed events to target indices or external storage systems via destination connectors.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Establishes connections to target storage systems or databases to enable automated delivery of processed data.
Rete is a framework for building interactive, node-based visual interfaces and dataflow programming environments. It provides a core engine that processes directed graphs, allowing developers to define modular logic where nodes represent operations and connections represent the flow of data or control. By decoupling the graph logic from the user interface, the framework enables the creation of custom visual editors that can be integrated into various frontend component libraries. The project distinguishes itself through a highly extensible, signal-driven architecture that supports complex req
Provides hybrid execution models for processing data and control flow through node graphs.
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Supports running data integration tasks across various processing backends to optimize performance.
Mage AI est un orchestrateur de pipelines de données basé sur Python et un environnement de développement intégré (IDE) de données auto-hébergé. Il est conçu pour construire, planifier et surveiller des workflows de données en utilisant une conception de pipeline par blocs et une interface de notebook interactive. La plateforme se distingue en intégrant des capacités d'IA générative, permettant aux utilisateurs de connecter des fournisseurs de grands modèles de langage via API pour incorporer l'intelligence artificielle dans des flux de données automatisés. Elle fonctionne également comme un processeur de données Apache Spark, gérant les kernels et l'infrastructure requis pour l'analytique à haut volume et le traitement de données à grande échelle. Le système couvre un large éventail de capacités d'ingénierie de données, incluant l'automatisation de workflows ETL, la gestion de modèles dbt et la découverte de flux de données. Il fournit des outils pour l'intégration du contrôle de version via Git, le déploiement conteneurisé et le contrôle d'accès basé sur les rôles pour gérer les pipelines dans les environnements de développement et de production. La surveillance est gérée via la télémétrie des performances système et le débogage de l'exécution des pipelines.
Provides configuration interfaces to push processed datasets into target databases, warehouses, or cloud storage.
CloudQuery is a cloud infrastructure ETL tool and multi-cloud data pipeline designed to collect, synchronize, and normalize resource metadata from various cloud providers and SaaS platforms. It functions as a centralized asset inventory manager and security posture manager, extracting configuration and state data into relational databases, data lakes, or data warehouses. The system distinguishes itself by transforming complex, nested cloud API responses into flat relational tables, enabling the use of standard SQL for asset querying and analysis. It employs a modular plugin system for data ex
Implements driver-based adapters to establish connections and push metadata into various target storage systems and databases.
Cocoindex is an incremental data processing engine that builds and maintains live indexes for AI agents, with a core focus on codebase indexing and knowledge graph extraction. The engine uses a function-graph execution model where user-defined Python functions are composed into a directed acyclic graph, and it processes data incrementally so only changed source records or code paths are re-computed, avoiding full recomputation at any scale. It supports automatic schema inference from transformation pipeline type annotations and provides full data lineage tracing, tagging every output record wi
Exports indexed data to any destination including local files, cloud storage, or REST APIs.
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Supports running Hive queries on Apache Spark for accelerated performance.
dlt est un outil d'ingestion de données Python et un framework de pipeline ETL conçu pour récupérer des données depuis diverses sources et les persister dans des destinations structurées. Il fonctionne comme un moteur d'inférence de schéma qui détecte automatiquement les types de données et aplatit les structures JSON imbriquées en tables relationnelles, déplaçant les données des sources vers des lakehouses, des entrepôts ou des bases de données vectorielles. Le projet se distingue par une génération de pipeline alimentée par l'IA, utilisant de grands modèles de langage pour échafauder le code d'extraction et les connecteurs pour les API REST. Il prend également en charge le stockage vectoriel multimodal et la population spécialisée de bases de données vectorielles pour prendre en charge les applications d'IA et de machine learning. Le framework couvre un large éventail de capacités, incluant l'évolution automatique du schéma, le chargement incrémentiel de données via le suivi d'état et la validation de la qualité des données par l'application de contrats de données. Il fournit des outils pour la normalisation des données relationnelles, les transformations pré- et post-chargement, et une variété d'adaptateurs de destination pour les bases de données SQL et les magasins d'objets cloud. L'observabilité est gérée via des tableaux de bord d'exécution de pipeline, le suivi de lignage des colonnes et la vérification de version de schéma utilisant des hachages basés sur le contenu.
Provides connectors to write extracted data into relational databases like Postgres, MySQL, and BigQuery.
Fluvio est une plateforme de streaming d'événements distribuée et un moteur de streaming cloud-native conçu pour collecter, persister et répliquer des flux de données en temps réel à travers un cluster distribué. Il fonctionne comme un pipeline de données temps réel pour construire des workflows avec état qui ingèrent, enrichissent et exportent des données entre des sources et des destinations externes. La plateforme se distingue par son utilisation de WebAssembly pour exécuter des modules compilés pour des transformations et filtrages de données en ligne. Cela permet l'exécution d'une logique métier personnalisée pour remodeler l'information en mouvement sans nécessiter de redémarrage du cluster. Le système couvre un large éventail de capacités, incluant l'ingestion de données basée sur des connecteurs depuis des protocoles externes, un stockage immuable structuré en logs avec E/S zéro-copie, et une mise à l'échelle horizontale du cluster. Il prend en charge la création de pipelines complexes pilotés par les événements qui utilisent le traitement avec état, les agrégations par fenêtrage et la distribution de données basée sur les partitions. Le moteur peut être déployé comme un binaire léger sur diverses architectures système, y compris des appareils IoT ARM64 pour le traitement de données en périphérie (edge).
Ships configuration interfaces for establishing connections to external target storage systems and databases.
DevLake is a DevOps data platform and analytics tool designed to orchestrate data pipelines that ingest, transform, and sync metadata from external development tools into a unified database. It functions as a system for collecting and normalizing data from source control, CI/CD pipelines, and issue trackers into a standardized schema to enable consistent software delivery analytics. The platform distinguishes itself by transforming tool-specific data into a common domain model, allowing for the calculation of engineering metrics via SQL. It provides specialized frameworks for measuring DORA m
Offers a guided process for setting up ingestion parameters to automate how data is gathered from various sources.
Connector-X is a high-performance SQL data extraction library and bridge for transferring relational database records into memory-efficient data structures. It functions as a parallel database connector and federated query engine capable of executing and joining queries across multiple remote database connections to aggregate data locally. The project distinguishes itself through a zero-copy approach to data loading, which transfers SQL query results into memory structures without duplicating data. It maximizes throughput by partitioning SQL queries into threads, employing parallel columnar a
Allows the creation of new output formats by specifying memory allocation and data partitioning during the writing process.