15 مستودعات
Systems designed to automate and manage the execution of recurring data processing jobs.
Distinguishing note: Specifically targets batch-oriented workflow scheduling rather than general-purpose task automation.
Explore 15 awesome GitHub repositories matching data & databases · Batch Processing Schedulers. Refine with filters or upvote what's useful.
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Define and monitor complex data pipelines using code-based configurations that support dynamic task generation to automate recurring business processes.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Automates and manages the execution of complex batch data processing pipelines across distributed environments.
Argo is a cloud native CI/CD platform and Kubernetes workflow engine. It functions as a container pipeline orchestrator and job scheduler, managing multi-step sequences of containers as jobs using directed acyclic graphs within a cluster. The system acts as a progressive delivery controller, reducing release risk through automated Canary and Blue-Green deployment strategies. It provides declarative GitOps synchronization to mirror the state of a git repository directly into the cluster environment for continuous delivery automation. The platform covers a broad range of capabilities including
Runs recurring jobs on a fixed timetable using cron-based schedules for routine maintenance and data tasks.
Argo Workflows is a container-native workflow engine that functions as a Kubernetes custom resource controller. It orchestrates complex sequences of containerized tasks by executing them as directed acyclic graphs, allowing for dependency management and parallel processing within a cluster. The system extends the native Kubernetes control plane to manage the full lifecycle of automated processes, from initial triggering to final resource cleanup. The platform distinguishes itself through its controller-pattern reconciliation, which continuously monitors workflow states to align them with desi
Runs periodic data processing jobs and routine infrastructure maintenance tasks on a fixed schedule or triggered by external events.
Nomad is a distributed workload orchestrator and infrastructure automation platform designed to manage the lifecycle of applications across large-scale, heterogeneous environments. It functions as a multi-cloud orchestration engine, providing a unified control plane to deploy, scale, and govern containers, virtual machines, and legacy applications. By utilizing declarative job specifications, the system ensures infrastructure convergence and maintains the desired state across distributed data centers and geographic regions. The platform distinguishes itself through a flexible, plugin-based ar
Schedules high-throughput concurrent tasks and parameterized workloads for data analytics and background processing.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Manages asynchronous document transformation jobs by queuing requests, tracking job status, and retrieving processed output files upon completion.
Dask هو إطار عمل للحوسبة المتوازية وجدول مهام موزع مصمم لتوسيع نطاق سير عمل علوم البيانات في Python من أجهزة فردية إلى مجموعات (clusters) كبيرة. يعمل كمدير موارد للمجموعة يقوم بتنسيق المنطق الحسابي من خلال تمثيل المهام وتبعياتها كرسوم بيانية موجهة غير دورية. تسمح هذه البنية للنظام بأتمتة توزيع أعباء العمل عبر الأجهزة المتاحة مع إدارة متطلبات التنفيذ المعقدة. يتميز المشروع بمحرك تقييم كسول يؤجل عمليات البيانات حتى يتم طلبها صراحة، مما يتيح تحسين الرسم البياني العالمي وتخصيص الموارد بكفاءة. يتضمن خاصية تسريب البيانات الواعية بالذاكرة لمنع تعطل النظام عند معالجة مجموعات البيانات التي تتجاوز الذاكرة المتاحة، ويستخدم دمج الرسم البياني للمهام لدمج تسلسلات العمليات في خطوات تنفيذ واحدة، مما يقلل من عبء الجدولة والاتصال بين العقد. توفر المنصة سطح قدرات شاملاً لتحليلات البيانات واسعة النطاق، بما في ذلك دعم التعلم الآلي الموزع، وتكامل الحوسبة عالية الأداء، ومعالجة البيانات المتوازية. توفر أدوات واسعة النطاق لإدارة دورة حياة المجموعة، وتوصيف الأداء، والمراقبة في الوقت الفعلي لتنفيذ المهام. يمكن للمستخدمين نشر هذه البيئات عبر بنية تحتية متنوعة، بما في ذلك الأجهزة المحلية، ومزودي السحابة، والأنظمة الحاوية، ومجموعات الحوسبة عالية الأداء.
Distributes inference workloads across multiple processing units to apply trained models to large volumes of data.
DataLoader is a utility that collects individual data loads into a single batch and caches results to minimize redundant backend requests. It operates on a batch-and-cache architecture, where multiple data lookups within a single execution frame are grouped together and dispatched as one request, with the results stored in memory for instant retrieval on subsequent calls. The utility distinguishes itself through several key capabilities. It supports per-key error handling, allowing partial failures within a batch without rejecting the entire operation. A cache priming mechanism lets developer
Controls when a batch of collected loads is dispatched, enabling manual triggering or delayed execution.
Banana-slides is a generative AI workflow engine designed to automate the creation and refinement of professional slide decks. By leveraging large language models, the platform transforms raw text, structured outlines, and existing documents into visual presentations. It functions as an automated tool that orchestrates the entire lifecycle of a presentation, from initial content generation and layout design to final export. The system distinguishes itself through a modular provider abstraction that allows users to integrate various artificial intelligence services for content and image synthe
Manages large-scale generation tasks with support for error handling, progress tracking, and state persistence.
This tool is a command-line utility designed to synchronize and archive media from cloud storage to local directories. It functions as an automated backup service that maintains a local mirror of remote photo libraries, ensuring that local storage remains current with remote changes through periodic monitoring and incremental updates. The project distinguishes itself through its support for persistent, containerized background execution, which allows for continuous, automated management of media collections. It provides robust multi-account isolation, enabling users to manage multiple indepen
Executes recurring data transfer jobs at regular intervals to keep local storage synchronized.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Runs a batch engine on a recurring schedule to materialize features.
Qor is a Go admin framework and backend toolkit used for building administrative interfaces, headless content management systems, and REST API generators. It provides a structured environment for implementing business application backends, specializing in the management of structured content and media assets. The project distinguishes itself through comprehensive multi-language content management, featuring locale-based data versioning and a dedicated system for internationalization and translation administration. It further differentiates its offering with a built-in state machine implementa
Provides a system for executing background tasks and jobs on a defined schedule.
هذا المشروع عبارة عن مطالب محتوى رقمي تلقائي وبوت أتمتة لمتجر الألعاب. يعمل كعميل بدون رأس (headless) يتعامل مع مصادقة الحساب وتسلسلات الطلبات لجمع الألعاب الرقمية المجانية والمحتوى القابل للتنزيل وفق جدول زمني. توفر الأداة أتمتة محددة لـ Epic Games Store وGOG وAmazon Prime Gaming. تستخدم منطق مهايئ خاص بواجهة المتجر لتأمين العروض محدودة الوقت وبناء مكتبة ألعاب رقمية دون تدخل يدوي من المتصفح. يدمج النظام جدولة المهام القائمة على cron للفحوصات اليومية، وتدفقات تسجيل الدخول التلقائية باستخدام بيانات الاعتماد المخزنة، وأتمتة المتصفح بدون رأس. كما يتضمن نظام إشعارات يرسل تنبيهات حالة المطالبة عبر خطافات الويب (webhooks) الخارجية.
Schedules recurring batch jobs to execute the content collection process on a fixed daily timetable.
Orchest هو منسق لخطوط أنابيب البيانات ومدير سير عمل قائم على الحاويات. يوفر منصة لتصميم وجدولة وتنفيذ تسلسلات معالجة البيانات المعقدة من خلال مزيج من الواجهة الرسومية والبرمجة النصية. تتميز المنصة باستخدام الحاويات لإدارة تبعيات البرمجيات، مما يضمن تنفيذاً متسقاً عبر بيئات مختلفة. وتتميز بجدول مهام متعدد اللغات قادر على تشغيل الوظائف المكتوبة بلغات برمجة متعددة وتتضمن نظام تحكم في الإصدار يتتبع اللقطات التاريخية لتكوينات المشروع والكود. يغطي النظام تصميم سير العمل المرئي ورسم خرائط التبعية القائمة على الرسوم البيانية، إلى جانب جدولة المهام المعتمدة على الوقت للتنفيذ المتكرر أو الفوري. كما يدعم نشر خدمات الخلفية المستمرة التي تظل نشطة طوال فترة تشغيل خط الأنابيب.
Automates and manages the execution of recurring data processing jobs on a scheduled basis.
Panda Factor is a quantitative trading infrastructure and alpha factor framework. It serves as a backend system for building, calculating, and managing mathematical signals designed to predict the price movements of financial assets. The project functions as a technical indicator engine that generates quantitative metrics from price and volume data. It utilizes a financial data pipeline to automate the synchronization of market data from multiple providers on a nightly schedule. The system provides capabilities for quantitative alpha generation and the construction of financial indicators us
Automates the recurring nightly synchronization of market data from external providers to maintain historical records.