9 مستودعات
Compute-intensive operations for file, API, and database interactions.
Distinguishing note: Focuses on data-heavy operations rather than general task orchestration.
Explore 9 awesome GitHub repositories matching data & databases · Data Processing Tasks. Refine with filters or upvote what's useful.
Kestra is a declarative workflow orchestrator designed to manage complex task dependencies and automated processes through versioned configuration files. It functions as a distributed platform that decouples task scheduling from execution by offloading computational workloads to a fleet of worker nodes. The system uses a reactive, event-driven engine to initiate workflows automatically in response to external signals, webhooks, schedules, or file system changes. The platform distinguishes itself through a modular plugin architecture that allows for the integration of custom tasks and external
Offloads compute-intensive data tasks like file operations and API requests to dedicated worker components.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Encapsulates units of computation by specifying input requirements and output targets.
CVAT is an open-source, web-based platform designed for annotating images, videos, and 3D point clouds to create high-quality training datasets for machine learning. It functions as a containerized server that orchestrates the entire lifecycle of computer vision data, from initial task creation and manual labeling to quality assurance and final dataset export. The platform distinguishes itself through deep integration with machine learning models, allowing users to deploy custom AI models as serverless functions for automated object detection, tracking, and skeleton annotation. It supports co
Enables bulk labeling and data organization across multiple files or frames using automated scripts.
FileCentipede is a comprehensive file management and transfer application designed to handle diverse network protocols and data operations. It functions as a multi-protocol download manager, a full-featured BitTorrent client, and a remote filesystem manager, providing a unified interface for moving and organizing data across local and remote environments. The application distinguishes itself through deep browser integration, which allows for the direct capture of media streams, video, and bulk download links from web pages. It also includes a modular utility suite that enables users to perfor
Executes compute-intensive data processing tasks including file merging, checksums, and HTTP request construction.
Records is a SQL database client designed for executing raw queries and managing result sets through a simplified interface. It provides a parameterized SQL executor to bind values to placeholders, ensuring safe data handling and preventing injection attacks, alongside a database transaction manager for grouping operations into atomic units. The project includes a dedicated command-line interface for running database statements and exporting query results directly to local files. This tooling allows for the conversion of SQL result sets into multiple serialization formats, including CSV, JSON
Executes the same SQL query multiple times with different parameters to handle large datasets efficiently.
HomeBox is a self-hosted home inventory manager designed for tracking physical belongings and household assets. It functions as a digital catalog for creating structured databases of objects, including records of locations, categories, and purchase history. The system distinguishes itself through the use of QR code generation to link physical objects to digital records and the support of hierarchical location mapping to track assets across nested environments. It further enables automation via a REST API and centralizes access management through OpenID Connect integration for user authenticat
Supports large-scale imports and exports of inventory records using comma-separated values for efficient batch updates.
StreamPark هو منصة إدارة مركزية مصممة لتنسيق النشر، والمراقبة، ودورة الحياة التشغيلية لتطبيقات معالجة التدفق الموزعة والتطبيقات الدفعية (batch). يعمل كطائرة تحكم (control plane) ومنسق لخطوط أنابيب البيانات، ويوفر تحديداً قدرات إدارة لبيئات Apache Flink و Hadoop YARN. تتميز المنصة بنهج منخفض الكود (low-code) لنشر المهام ومحول تنفيذ متعدد المحركات يدعم أوقات تشغيل معالجة متنوعة. تسهل إدارة خط أنابيب البيانات في الوقت الفعلي من خلال الجمع بين تحليلات SQL المتدفقة وخط أنابيب نشر قائم على الموارد يتعامل مع الإصدارات، وتحميل الملفات الثنائية، واستعادة الحالة القائمة على نقاط الحفظ (savepoints). يغطي النظام مجموعة واسعة من القدرات بما في ذلك تنسيق الوظائف الموزعة، وتكامل البيانات في الوقت الفعلي عبر موصلات مبنية مسبقاً، وتكامل الهوية عبر LDAP أو SSO. كما يوفر أدوات مراقبة لمراقبة التطبيقات على مستوى الثانية وإخطارات الأعطال التشغيلية المؤتمتة.
Handles compute-intensive operations including startup, savepoints, and performance analysis of stream jobs.
GAM is a command-line tool for administering Google Workspace and Cloud Identity. It translates command-line arguments into structured API calls, enabling administrators to manage users, groups, organizational units, and domain settings across a Google Workspace environment. The tool handles authentication through OAuth2 flows, service accounts, and workload identity federation, and supports multi-tenant configurations for managing multiple domains or cloud projects from a single installation. GAM distinguishes itself through its batch processing and automation capabilities. It can process la
Performs bulk updates and exports across Google Workspace services using data sourced from CSV files.
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Transfers bulk email accounts between different hosting providers using administrative or user credentials.