16 مستودعات
Settings and strategies for handling data ingestion, including chunking and constraint management.
Distinguishing note: Focuses on the configuration of data ingestion pipelines rather than raw storage or database management.
Explore 16 awesome GitHub repositories matching data & databases · Data Processing Configurations. Refine with filters or upvote what's useful.
CrewAI is a multi-agent orchestration framework designed for building autonomous systems that execute complex, multi-step workflows. It provides a development platform where specialized agents are defined with specific roles, goals, and tool sets to perform tasks collaboratively. By leveraging a declarative workflow engine, the system manages task dependencies, state transitions, and execution logic, allowing for the creation of structured, stateful sequences of operations. The framework distinguishes itself through its hierarchical management capabilities, which utilize manager agents to coo
CrewAI manages how files are processed when they exceed provider constraints by selecting modes like strict, auto, or chunking.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Sets global parameters for block sizes and shuffle strategies to control data operations across the cluster.
This project is a Python-based framework that functions as a generative AI agent for programmatic data analysis. It enables users to interact with structured data sources through natural language prompts, translating these requests into executable code to perform analysis, data cleaning, and visualization. By maintaining conversational context across multi-turn interactions, the system allows for iterative exploration and the building of complex data narratives. The framework distinguishes itself through a robust semantic layer and secure execution model. It maps raw datasets to descriptive m
Configures data ingestion and cleaning rules to prepare raw datasets for conversational interaction.
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Outputs processed data streams to external systems such as message queues, databases, or files.
Logstash is a JVM-based event processor and extract, transform, load system designed for log data processing pipelines. It functions as a plugin-based data ingestor that collects, transforms, and delivers logs and event data from multiple sources to various destinations. The system utilizes a modular architecture of interchangeable input, filter, and output components to handle real-time data ingestion and enterprise log aggregation. Users can extend the pipeline's functionality by developing custom plugins to support unique data sources or specific transformation logic. The platform covers
Routes processed events to target indices or external storage systems via destination connectors.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Establishes connections to target storage systems or databases to enable automated delivery of processed data.
Rete is a framework for building interactive, node-based visual interfaces and dataflow programming environments. It provides a core engine that processes directed graphs, allowing developers to define modular logic where nodes represent operations and connections represent the flow of data or control. By decoupling the graph logic from the user interface, the framework enables the creation of custom visual editors that can be integrated into various frontend component libraries. The project distinguishes itself through a highly extensible, signal-driven architecture that supports complex req
Provides hybrid execution models for processing data and control flow through node graphs.
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Supports running data integration tasks across various processing backends to optimize performance.
Mage AI is a Python-based data pipeline orchestrator and self-hosted data integrated development environment. It is designed for building, scheduling, and monitoring data workflows using a block-based pipeline design and interactive notebook interface. The platform distinguishes itself by integrating generative AI capabilities, allowing users to connect large language model providers via API to incorporate artificial intelligence into automated data streams. It also functions as an Apache Spark data processor, managing the kernels and infrastructure required for high-volume analytics and larg
Provides configuration interfaces to push processed datasets into target databases, warehouses, or cloud storage.
CloudQuery is a cloud infrastructure ETL tool and multi-cloud data pipeline designed to collect, synchronize, and normalize resource metadata from various cloud providers and SaaS platforms. It functions as a centralized asset inventory manager and security posture manager, extracting configuration and state data into relational databases, data lakes, or data warehouses. The system distinguishes itself by transforming complex, nested cloud API responses into flat relational tables, enabling the use of standard SQL for asset querying and analysis. It employs a modular plugin system for data ex
Implements driver-based adapters to establish connections and push metadata into various target storage systems and databases.
Cocoindex is an incremental data processing engine that builds and maintains live indexes for AI agents, with a core focus on codebase indexing and knowledge graph extraction. The engine uses a function-graph execution model where user-defined Python functions are composed into a directed acyclic graph, and it processes data incrementally so only changed source records or code paths are re-computed, avoiding full recomputation at any scale. It supports automatic schema inference from transformation pipeline type annotations and provides full data lineage tracing, tagging every output record wi
Exports indexed data to any destination including local files, cloud storage, or REST APIs.
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Supports running Hive queries on Apache Spark for accelerated performance.
dlt هي أداة لاستيعاب البيانات بلغة Python وإطار عمل لخط أنابيب ETL مصمم لجلب البيانات من مصادر متنوعة وحفظها في وجهات مهيكلة. تعمل كمحرك لاستنتاج المخطط (schema inference) يكتشف تلقائياً أنواع البيانات ويسطح هياكل JSON المتداخلة في جداول علائقية، ناقلاً البيانات من المصادر إلى بحيرات البيانات، أو المستودعات، أو قواعد بيانات المتجهات. يتميز المشروع بتوليد خط أنابيب مدعوم بالذكاء الاصطناعي، باستخدام نماذج لغات كبيرة لسقالات كود الاستخراج والموصلات لـ REST APIs. كما يدعم تخزين المتجهات متعدد الوسائط والتعبئة المتخصصة لقواعد بيانات المتجهات لدعم تطبيقات الذكاء الاصطناعي والتعلم الآلي. يغطي إطار العمل مجموعة واسعة من القدرات بما في ذلك تطور المخطط المؤتمت، وتحميل البيانات التزايدي عبر تتبع الحالة، والتحقق من جودة البيانات من خلال فرض عقود البيانات. يوفر أدوات لتطبيع البيانات العلائقية، وتحويلات ما قبل وما بعد التحميل، ومجموعة متنوعة من محولات الوجهة لقواعد بيانات SQL ومخازن الكائنات السحابية. تتم إدارة المراقبة من خلال لوحات معلومات تنفيذ خط الأنابيب، وتتبع نسب الأعمدة، والتحقق من إصدار المخطط باستخدام التجزئات القائمة على المحتوى.
Provides connectors to write extracted data into relational databases like Postgres, MySQL, and BigQuery.
Fluvio هو منصة تدفق أحداث موزعة ومحرك تدفق سحابي أصلي مصمم لجمع وتخزين ونسخ تدفقات البيانات في الوقت الفعلي عبر مجموعة موزعة. يعمل كخط أنابيب بيانات في الوقت الفعلي لبناء سير عمل ذي حالة يقوم باستيعاب وإثراء وتصدير البيانات بين المصادر والمصارف الخارجية. تتميز المنصة باستخدام WebAssembly لتنفيذ وحدات مجمعة لتحويلات البيانات والفلترة المضمنة. يسمح هذا بتنفيذ منطق أعمال مخصص لإعادة تشكيل المعلومات أثناء الحركة دون الحاجة إلى إعادة تشغيل المجموعة. يغطي النظام مجموعة واسعة من القدرات بما في ذلك استيعاب البيانات القائم على الموصلات من بروتوكولات خارجية، وتخزين غير قابل للتغيير قائم على السجلات مع إدخال/إخراج بدون نسخ، وتوسيع المجموعة الأفقي. يدعم إنشاء خطوط أنابيب معقدة قائمة على الأحداث تستخدم المعالجة ذات الحالة، والتجميعات القائمة على النوافذ، وتوزيع البيانات القائم على التقسيم. يمكن نشر المحرك كثنائي خفيف الوزن على معماريات نظام متنوعة، بما في ذلك أجهزة ARM64 IoT لمعالجة بيانات الحافة.
Ships configuration interfaces for establishing connections to external target storage systems and databases.
DevLake is a DevOps data platform and analytics tool designed to orchestrate data pipelines that ingest, transform, and sync metadata from external development tools into a unified database. It functions as a system for collecting and normalizing data from source control, CI/CD pipelines, and issue trackers into a standardized schema to enable consistent software delivery analytics. The platform distinguishes itself by transforming tool-specific data into a common domain model, allowing for the calculation of engineering metrics via SQL. It provides specialized frameworks for measuring DORA m
Offers a guided process for setting up ingestion parameters to automate how data is gathered from various sources.
Connector-X is a high-performance SQL data extraction library and bridge for transferring relational database records into memory-efficient data structures. It functions as a parallel database connector and federated query engine capable of executing and joining queries across multiple remote database connections to aggregate data locally. The project distinguishes itself through a zero-copy approach to data loading, which transfers SQL query results into memory structures without duplicating data. It maximizes throughput by partitioning SQL queries into threads, employing parallel columnar a
Allows the creation of new output formats by specifying memory allocation and data partitioning during the writing process.