Why is crewaiinc/crewai a recommended Data Processing Configurations GitHub Repositories repository?

CrewAI manages how files are processed when they exceed provider constraints by selecting modes like strict, auto, or chunking.

Why is ray-project/ray a recommended Data Processing Configurations GitHub Repositories repository?

Sets global parameters for block sizes and shuffle strategies to control data operations across the cluster.

Why is sinaptik-ai/pandas-ai a recommended Data Processing Configurations GitHub Repositories repository?

Configures data ingestion and cleaning rules to prepare raw datasets for conversational interaction.

Why is zhisheng17/flink-learning a recommended Data Processing Configurations GitHub Repositories repository?

Outputs processed data streams to external systems such as message queues, databases, or files.

Why is elastic/logstash a recommended Data Processing Configurations GitHub Repositories repository?

Routes processed events to target indices or external storage systems via destination connectors.

Why is unstructured-io/unstructured a recommended Data Processing Configurations GitHub Repositories repository?

Establishes connections to target storage systems or databases to enable automated delivery of processed data.

Why is retejs/rete a recommended Data Processing Configurations GitHub Repositories repository?

Provides hybrid execution models for processing data and control flow through node graphs.

Why is apache/seatunnel a recommended Data Processing Configurations GitHub Repositories repository?

Supports running data integration tasks across various processing backends to optimize performance.

Why is mage-ai/mage-ai a recommended Data Processing Configurations GitHub Repositories repository?

Provides configuration interfaces to push processed datasets into target databases, warehouses, or cloud storage.

Why is cloudquery/cloudquery a recommended Data Processing Configurations GitHub Repositories repository?

Implements driver-based adapters to establish connections and push metadata into various target storage systems and databases.

16 مستودعات

Awesome GitHub RepositoriesData Processing Configurations

Settings and strategies for handling data ingestion, including chunking and constraint management.

Distinguishing note: Focuses on the configuration of data ingestion pipelines rather than raw storage or database management.

Explore 16 awesome GitHub repositories matching data & databases · Data Processing Configurations. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

crewaiinc/crewai
crewAIInc/crewAI
53,687عرض على GitHub
CrewAI is a multi-agent orchestration framework designed for building autonomous systems that execute complex, multi-step workflows. It provides a development platform where specialized agents are defined with specific roles, goals, and tool sets to perform tasks collaboratively. By leveraging a declarative workflow engine, the system manages task dependencies, state transitions, and execution logic, allowing for the creation of structured, stateful sequences of operations. The framework distinguishes itself through its hierarchical management capabilities, which utilize manager agents to coo
CrewAI manages how files are processed when they exceed provider constraints by selecting modes like strict, auto, or chunking.
Pythonagentsaiai-agents
عرض على GitHub53,687
ray-project/ray
ray-project/ray
42,895عرض على GitHub
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Sets global parameters for block sizes and shuffle strategies to control data operations across the cluster.
Pythondata-sciencedeep-learningdeployment
عرض على GitHub42,895
sinaptik-ai/pandas-ai
sinaptik-ai/pandas-ai
23,197عرض على GitHub
This project is a Python-based framework that functions as a generative AI agent for programmatic data analysis. It enables users to interact with structured data sources through natural language prompts, translating these requests into executable code to perform analysis, data cleaning, and visualization. By maintaining conversational context across multi-turn interactions, the system allows for iterative exploration and the building of complex data narratives. The framework distinguishes itself through a robust semantic layer and secure execution model. It maps raw datasets to descriptive m
Configures data ingestion and cleaning rules to prepare raw datasets for conversational interaction.
Pythonaicsvdata
عرض على GitHub23,197
zhisheng17/flink-learning
zhisheng17/flink-learning
15,071عرض على GitHub
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Outputs processed data streams to external systems such as message queues, databases, or files.
Javaclickhouseelasticsearchflink
عرض على GitHub15,071
elastic/logstash
elastic/logstash
14,884عرض على GitHub
Logstash is a JVM-based event processor and extract, transform, load system designed for log data processing pipelines. It functions as a plugin-based data ingestor that collects, transforms, and delivers logs and event data from multiple sources to various destinations. The system utilizes a modular architecture of interchangeable input, filter, and output components to handle real-time data ingestion and enterprise log aggregation. Users can extend the pipeline's functionality by developing custom plugins to support unique data sources or specific transformation logic. The platform covers
Routes processed events to target indices or external storage systems via destination connectors.
Java
عرض على GitHub14,884
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019عرض على GitHub
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Establishes connections to target storage systems or databases to enable automated delivery of processed data.
HTMLdata-pipelinesdeep-learningdocument-image-analysis
عرض على GitHub14,019
retejs/rete
retejs/rete
12,077عرض على GitHub
Rete is a framework for building interactive, node-based visual interfaces and dataflow programming environments. It provides a core engine that processes directed graphs, allowing developers to define modular logic where nodes represent operations and connections represent the flow of data or control. By decoupling the graph logic from the user interface, the framework enables the creation of custom visual editors that can be integrated into various frontend component libraries. The project distinguishes itself through a highly extensible, signal-driven architecture that supports complex req
Provides hybrid execution models for processing data and control flow through node graphs.
TypeScriptdataflow-programmingflow-based-programminggraph-editor
عرض على GitHub12,077
apache/seatunnel
apache/seatunnel
9,427عرض على GitHub
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Supports running data integration tasks across various processing backends to optimize performance.
Javaapachebatchcdc
عرض على GitHub9,427
mage-ai/mage-ai
mage-ai/mage-ai
8,759عرض على GitHub
Mage AI is a Python-based data pipeline orchestrator and self-hosted data integrated development environment. It is designed for building, scheduling, and monitoring data workflows using a block-based pipeline design and interactive notebook interface. The platform distinguishes itself by integrating generative AI capabilities, allowing users to connect large language model providers via API to incorporate artificial intelligence into automated data streams. It also functions as an Apache Spark data processor, managing the kernels and infrastructure required for high-volume analytics and larg
Provides configuration interfaces to push processed datasets into target databases, warehouses, or cloud storage.
Python
عرض على GitHub8,759
cloudquery/cloudquery
cloudquery/cloudquery
6,438عرض على GitHub
CloudQuery is a cloud infrastructure ETL tool and multi-cloud data pipeline designed to collect, synchronize, and normalize resource metadata from various cloud providers and SaaS platforms. It functions as a centralized asset inventory manager and security posture manager, extracting configuration and state data into relational databases, data lakes, or data warehouses. The system distinguishes itself by transforming complex, nested cloud API responses into flat relational tables, enabling the use of standard SQL for asset querying and analysis. It employs a modular plugin system for data ex
Implements driver-based adapters to establish connections and push metadata into various target storage systems and databases.
Goairbyteattack-surface-managementaws
عرض على GitHub6,438
cocoindex-io/cocoindex
cocoindex-io/cocoindex
6,117عرض على GitHub
Cocoindex is an incremental data processing engine that builds and maintains live indexes for AI agents, with a core focus on codebase indexing and knowledge graph extraction. The engine uses a function-graph execution model where user-defined Python functions are composed into a directed acyclic graph, and it processes data incrementally so only changed source records or code paths are re-computed, avoiding full recomputation at any scale. It supports automatic schema inference from transformation pipeline type annotations and provides full data lineage tracing, tagging every output record wi
Exports indexed data to any destination including local files, cloud storage, or REST APIs.
Rustagentic-data-frameworkaiai-agents
عرض على GitHub6,117
apache/hive
apache/hive
6,012عرض على GitHub
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Supports running Hive queries on Apache Spark for accelerated performance.
Javaapachebig-datadatabase
عرض على GitHub6,012
dlt-hub/dlt
dlt-hub/dlt
5,472عرض على GitHub
dlt هي أداة لاستيعاب البيانات بلغة Python وإطار عمل لخط أنابيب ETL مصمم لجلب البيانات من مصادر متنوعة وحفظها في وجهات مهيكلة. تعمل كمحرك لاستنتاج المخطط (schema inference) يكتشف تلقائياً أنواع البيانات ويسطح هياكل JSON المتداخلة في جداول علائقية، ناقلاً البيانات من المصادر إلى بحيرات البيانات، أو المستودعات، أو قواعد بيانات المتجهات. يتميز المشروع بتوليد خط أنابيب مدعوم بالذكاء الاصطناعي، باستخدام نماذج لغات كبيرة لسقالات كود الاستخراج والموصلات لـ REST APIs. كما يدعم تخزين المتجهات متعدد الوسائط والتعبئة المتخصصة لقواعد بيانات المتجهات لدعم تطبيقات الذكاء الاصطناعي والتعلم الآلي. يغطي إطار العمل مجموعة واسعة من القدرات بما في ذلك تطور المخطط المؤتمت، وتحميل البيانات التزايدي عبر تتبع الحالة، والتحقق من جودة البيانات من خلال فرض عقود البيانات. يوفر أدوات لتطبيع البيانات العلائقية، وتحويلات ما قبل وما بعد التحميل، ومجموعة متنوعة من محولات الوجهة لقواعد بيانات SQL ومخازن الكائنات السحابية. تتم إدارة المراقبة من خلال لوحات معلومات تنفيذ خط الأنابيب، وتتبع نسب الأعمدة، والتحقق من إصدار المخطط باستخدام التجزئات القائمة على المحتوى.
Provides connectors to write extracted data into relational databases like Postgres, MySQL, and BigQuery.
Pythondatadata-engineeringdata-lake
عرض على GitHub5,472
infinyon/fluvio
infinyon/fluvio
5,231عرض على GitHub
Fluvio هو منصة تدفق أحداث موزعة ومحرك تدفق سحابي أصلي مصمم لجمع وتخزين ونسخ تدفقات البيانات في الوقت الفعلي عبر مجموعة موزعة. يعمل كخط أنابيب بيانات في الوقت الفعلي لبناء سير عمل ذي حالة يقوم باستيعاب وإثراء وتصدير البيانات بين المصادر والمصارف الخارجية. تتميز المنصة باستخدام WebAssembly لتنفيذ وحدات مجمعة لتحويلات البيانات والفلترة المضمنة. يسمح هذا بتنفيذ منطق أعمال مخصص لإعادة تشكيل المعلومات أثناء الحركة دون الحاجة إلى إعادة تشغيل المجموعة. يغطي النظام مجموعة واسعة من القدرات بما في ذلك استيعاب البيانات القائم على الموصلات من بروتوكولات خارجية، وتخزين غير قابل للتغيير قائم على السجلات مع إدخال/إخراج بدون نسخ، وتوسيع المجموعة الأفقي. يدعم إنشاء خطوط أنابيب معقدة قائمة على الأحداث تستخدم المعالجة ذات الحالة، والتجميعات القائمة على النوافذ، وتوزيع البيانات القائم على التقسيم. يمكن نشر المحرك كثنائي خفيف الوزن على معماريات نظام متنوعة، بما في ذلك أجهزة ARM64 IoT لمعالجة بيانات الحافة.
Ships configuration interfaces for establishing connections to external target storage systems and databases.
Rust
عرض على GitHub5,231
apache/incubator-devlake
apache/incubator-devlake
2,940عرض على GitHub
DevLake is a DevOps data platform and analytics tool designed to orchestrate data pipelines that ingest, transform, and sync metadata from external development tools into a unified database. It functions as a system for collecting and normalizing data from source control, CI/CD pipelines, and issue trackers into a standardized schema to enable consistent software delivery analytics. The platform distinguishes itself by transforming tool-specific data into a common domain model, allowing for the calculation of engineering metrics via SQL. It provides specialized frameworks for measuring DORA m
Offers a guided process for setting up ingestion parameters to automate how data is gathered from various sources.
Godashboard-friendlydatadata-analysis
عرض على GitHub2,940
sfu-db/connector-x
sfu-db/connector-x
2,561عرض على GitHub
Connector-X is a high-performance SQL data extraction library and bridge for transferring relational database records into memory-efficient data structures. It functions as a parallel database connector and federated query engine capable of executing and joining queries across multiple remote database connections to aggregate data locally. The project distinguishes itself through a zero-copy approach to data loading, which transfers SQL query results into memory structures without duplicating data. It maximizes throughput by partitioning SQL queries into threads, employing parallel columnar a
Allows the creation of new output formats by specifying memory allocation and data partitioning during the writing process.
Rustcppdatabasedataframe
عرض على GitHub2,561

Awesome Data Processing Configurations GitHub Repositories

crewAIInc/crewAI

ray-project/ray

sinaptik-ai/pandas-ai

zhisheng17/flink-learning

elastic/logstash

Unstructured-IO/unstructured

retejs/rete

apache/seatunnel

mage-ai/mage-ai

cloudquery/cloudquery

cocoindex-io/cocoindex

apache/hive

dlt-hub/dlt

infinyon/fluvio

apache/incubator-devlake

sfu-db/connector-x

استكشف الوسوم الفرعية