Why is apache/airflow a recommended Batch Processing Schedulers GitHub Repositories repository?

Define and monitor complex data pipelines using code-based configurations that support dynamic task generation to automate recurring business processes.

Why is spotify/luigi a recommended Batch Processing Schedulers GitHub Repositories repository?

Automates and manages the execution of complex batch data processing pipelines across distributed environments.

Why is argoproj/argo a recommended Batch Processing Schedulers GitHub Repositories repository?

Runs recurring jobs on a fixed timetable using cron-based schedules for routine maintenance and data tasks.

Why is argoproj/argo-workflows a recommended Batch Processing Schedulers GitHub Repositories repository?

Runs periodic data processing jobs and routine infrastructure maintenance tasks on a fixed schedule or triggered by external events.

Why is hashicorp/nomad a recommended Batch Processing Schedulers GitHub Repositories repository?

Schedules high-throughput concurrent tasks and parameterized workloads for data analytics and background processing.

Why is unstructured-io/unstructured a recommended Batch Processing Schedulers GitHub Repositories repository?

Manages asynchronous document transformation jobs by queuing requests, tracking job status, and retrieving processed output files upon completion.

Why is dask/dask a recommended Batch Processing Schedulers GitHub Repositories repository?

Distributes inference workloads across multiple processing units to apply trained models to large volumes of data.

Why is graphql/dataloader a recommended Batch Processing Schedulers GitHub Repositories repository?

Controls when a batch of collected loads is dispatched, enabling manual triggering or delayed execution.

Why is anionex/banana-slides a recommended Batch Processing Schedulers GitHub Repositories repository?

Manages large-scale generation tasks with support for error handling, progress tracking, and state persistence.

Why is icloud-photos-downloader/icloud_photos_downloader a recommended Batch Processing Schedulers GitHub Repositories repository?

Executes recurring data transfer jobs at regular intervals to keep local storage synchronized.

15 مستودعات

Awesome GitHub RepositoriesBatch Processing Schedulers

Systems designed to automate and manage the execution of recurring data processing jobs.

Distinguishing note: Specifically targets batch-oriented workflow scheduling rather than general-purpose task automation.

Explore 15 awesome GitHub repositories matching data & databases · Batch Processing Schedulers. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

apache/airflow
apache/airflow
45,902عرض على GitHub
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Define and monitor complex data pipelines using code-based configurations that support dynamic task generation to automate recurring business processes.
Pythonairflowapacheapache-airflow
عرض على GitHub45,902
spotify/luigi
spotify/luigi
18,676عرض على GitHub
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Automates and manages the execution of complex batch data processing pipelines across distributed environments.
Pythonhadoopluigiorchestration-framework
عرض على GitHub18,676
argoproj/argo
argoproj/argo
16,770عرض على GitHub
Argo is a cloud native CI/CD platform and Kubernetes workflow engine. It functions as a container pipeline orchestrator and job scheduler, managing multi-step sequences of containers as jobs using directed acyclic graphs within a cluster. The system acts as a progressive delivery controller, reducing release risk through automated Canary and Blue-Green deployment strategies. It provides declarative GitOps synchronization to mirror the state of a git repository directly into the cluster environment for continuous delivery automation. The platform covers a broad range of capabilities including
Runs recurring jobs on a fixed timetable using cron-based schedules for routine maintenance and data tasks.
Go
عرض على GitHub16,770
argoproj/argo-workflows
argoproj/argo-workflows
16,466عرض على GitHub
Argo Workflows is a container-native workflow engine that functions as a Kubernetes custom resource controller. It orchestrates complex sequences of containerized tasks by executing them as directed acyclic graphs, allowing for dependency management and parallel processing within a cluster. The system extends the native Kubernetes control plane to manage the full lifecycle of automated processes, from initial triggering to final resource cleanup. The platform distinguishes itself through its controller-pattern reconciliation, which continuously monitors workflow states to align them with desi
Runs periodic data processing jobs and routine infrastructure maintenance tasks on a fixed schedule or triggered by external events.
Goairflowargoargo-workflows
عرض على GitHub16,466
hashicorp/nomad
hashicorp/nomad
16,211عرض على GitHub
Nomad is a distributed workload orchestrator and infrastructure automation platform designed to manage the lifecycle of applications across large-scale, heterogeneous environments. It functions as a multi-cloud orchestration engine, providing a unified control plane to deploy, scale, and govern containers, virtual machines, and legacy applications. By utilizing declarative job specifications, the system ensures infrastructure convergence and maintains the desired state across distributed data centers and geographic regions. The platform distinguishes itself through a flexible, plugin-based ar
Schedules high-throughput concurrent tasks and parameterized workloads for data analytics and background processing.
Go
عرض على GitHub16,211
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019عرض على GitHub
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Manages asynchronous document transformation jobs by queuing requests, tracking job status, and retrieving processed output files upon completion.
HTMLdata-pipelinesdeep-learningdocument-image-analysis
عرض على GitHub14,019
dask/dask
dask/dask
13,746عرض على GitHub
Dask هو إطار عمل للحوسبة المتوازية وجدول مهام موزع مصمم لتوسيع نطاق سير عمل علوم البيانات في Python من أجهزة فردية إلى مجموعات (clusters) كبيرة. يعمل كمدير موارد للمجموعة يقوم بتنسيق المنطق الحسابي من خلال تمثيل المهام وتبعياتها كرسوم بيانية موجهة غير دورية. تسمح هذه البنية للنظام بأتمتة توزيع أعباء العمل عبر الأجهزة المتاحة مع إدارة متطلبات التنفيذ المعقدة. يتميز المشروع بمحرك تقييم كسول يؤجل عمليات البيانات حتى يتم طلبها صراحة، مما يتيح تحسين الرسم البياني العالمي وتخصيص الموارد بكفاءة. يتضمن خاصية تسريب البيانات الواعية بالذاكرة لمنع تعطل النظام عند معالجة مجموعات البيانات التي تتجاوز الذاكرة المتاحة، ويستخدم دمج الرسم البياني للمهام لدمج تسلسلات العمليات في خطوات تنفيذ واحدة، مما يقلل من عبء الجدولة والاتصال بين العقد. توفر المنصة سطح قدرات شاملاً لتحليلات البيانات واسعة النطاق، بما في ذلك دعم التعلم الآلي الموزع، وتكامل الحوسبة عالية الأداء، ومعالجة البيانات المتوازية. توفر أدوات واسعة النطاق لإدارة دورة حياة المجموعة، وتوصيف الأداء، والمراقبة في الوقت الفعلي لتنفيذ المهام. يمكن للمستخدمين نشر هذه البيئات عبر بنية تحتية متنوعة، بما في ذلك الأجهزة المحلية، ومزودي السحابة، والأنظمة الحاوية، ومجموعات الحوسبة عالية الأداء.
Distributes inference workloads across multiple processing units to apply trained models to large volumes of data.
Pythondasknumpypandas
عرض على GitHub13,746
graphql/dataloader
graphql/dataloader
13,380عرض على GitHub
DataLoader is a utility that collects individual data loads into a single batch and caches results to minimize redundant backend requests. It operates on a batch-and-cache architecture, where multiple data lookups within a single execution frame are grouped together and dispatched as one request, with the results stored in memory for instant retrieval on subsequent calls. The utility distinguishes itself through several key capabilities. It supports per-key error handling, allowing partial failures within a batch without rejecting the entire operation. A cache priming mechanism lets developer
Controls when a batch of collected loads is dispatched, enabling manual triggering or delayed execution.
JavaScriptbatchdataloadergraphql
عرض على GitHub13,380
anionex/banana-slides
Anionex/banana-slides
12,060عرض على GitHub
Banana-slides is a generative AI workflow engine designed to automate the creation and refinement of professional slide decks. By leveraging large language models, the platform transforms raw text, structured outlines, and existing documents into visual presentations. It functions as an automated tool that orchestrates the entire lifecycle of a presentation, from initial content generation and layout design to final export. The system distinguishes itself through a modular provider abstraction that allows users to integrate various artificial intelligence services for content and image synthe
Manages large-scale generation tasks with support for error handling, progress tracking, and state persistence.
Pythonai-ppt-makerai-slide-builderai-slides
عرض على GitHub12,060
icloud-photos-downloader/icloud_photos_downloader
icloud-photos-downloader/icloud_photos_downloader
12,046عرض على GitHub
This tool is a command-line utility designed to synchronize and archive media from cloud storage to local directories. It functions as an automated backup service that maintains a local mirror of remote photo libraries, ensuring that local storage remains current with remote changes through periodic monitoring and incremental updates. The project distinguishes itself through its support for persistent, containerized background execution, which allows for continuous, automated management of media collections. It provides robust multi-account isolation, enabling users to manage multiple indepen
Executes recurring data transfer jobs at regular intervals to keep local storage synchronized.
Python
عرض على GitHub12,046
feast-dev/feast
feast-dev/feast
6,727عرض على GitHub
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Runs a batch engine on a recurring schedule to materialize features.
Pythonbig-datadata-engineeringdata-quality
عرض على GitHub6,727
qor/qor
qor/qor
5,345عرض على GitHub
Qor is a Go admin framework and backend toolkit used for building administrative interfaces, headless content management systems, and REST API generators. It provides a structured environment for implementing business application backends, specializing in the management of structured content and media assets. The project distinguishes itself through comprehensive multi-language content management, featuring locale-based data versioning and a dedicated system for internationalization and translation administration. It further differentiates its offering with a built-in state machine implementa
Provides a system for executing background tasks and jobs on a defined schedule.
Goadminapicms
عرض على GitHub5,345
vogler/free-games-claimer
vogler/free-games-claimer
4,142عرض على GitHub
هذا المشروع عبارة عن مطالب محتوى رقمي تلقائي وبوت أتمتة لمتجر الألعاب. يعمل كعميل بدون رأس (headless) يتعامل مع مصادقة الحساب وتسلسلات الطلبات لجمع الألعاب الرقمية المجانية والمحتوى القابل للتنزيل وفق جدول زمني. توفر الأداة أتمتة محددة لـ Epic Games Store وGOG وAmazon Prime Gaming. تستخدم منطق مهايئ خاص بواجهة المتجر لتأمين العروض محدودة الوقت وبناء مكتبة ألعاب رقمية دون تدخل يدوي من المتصفح. يدمج النظام جدولة المهام القائمة على cron للفحوصات اليومية، وتدفقات تسجيل الدخول التلقائية باستخدام بيانات الاعتماد المخزنة، وأتمتة المتصفح بدون رأس. كما يتضمن نظام إشعارات يرسل تنبيهات حالة المطالبة عبر خطافات الويب (webhooks) الخارجية.
Schedules recurring batch jobs to execute the content collection process on a fixed daily timetable.
JavaScriptamazon-gamesautomationclaimer
عرض على GitHub4,142
orchest/orchest
orchest/orchest
4,138عرض على GitHub
Orchest هو منسق لخطوط أنابيب البيانات ومدير سير عمل قائم على الحاويات. يوفر منصة لتصميم وجدولة وتنفيذ تسلسلات معالجة البيانات المعقدة من خلال مزيج من الواجهة الرسومية والبرمجة النصية. تتميز المنصة باستخدام الحاويات لإدارة تبعيات البرمجيات، مما يضمن تنفيذاً متسقاً عبر بيئات مختلفة. وتتميز بجدول مهام متعدد اللغات قادر على تشغيل الوظائف المكتوبة بلغات برمجة متعددة وتتضمن نظام تحكم في الإصدار يتتبع اللقطات التاريخية لتكوينات المشروع والكود. يغطي النظام تصميم سير العمل المرئي ورسم خرائط التبعية القائمة على الرسوم البيانية، إلى جانب جدولة المهام المعتمدة على الوقت للتنفيذ المتكرر أو الفوري. كما يدعم نشر خدمات الخلفية المستمرة التي تظل نشطة طوال فترة تشغيل خط الأنابيب.
Automates and manages the execution of recurring data processing jobs on a scheduled basis.
TypeScriptairflowclouddag
عرض على GitHub4,138
pandaai-tech/panda_factor
PandaAI-Tech/panda_factor
2,940عرض على GitHub
Panda Factor is a quantitative trading infrastructure and alpha factor framework. It serves as a backend system for building, calculating, and managing mathematical signals designed to predict the price movements of financial assets. The project functions as a technical indicator engine that generates quantitative metrics from price and volume data. It utilizes a financial data pipeline to automate the synchronization of market data from multiple providers on a nightly schedule. The system provides capabilities for quantitative alpha generation and the construction of financial indicators us
Automates the recurring nightly synchronization of market data from external providers to maintain historical records.
Python
عرض على GitHub2,940

Awesome Batch Processing Schedulers GitHub Repositories

apache/airflow

spotify/luigi

argoproj/argo

argoproj/argo-workflows

hashicorp/nomad

Unstructured-IO/unstructured

dask/dask

graphql/dataloader

Anionex/banana-slides

icloud-photos-downloader/icloud_photos_downloader

feast-dev/feast

qor/qor

vogler/free-games-claimer

orchest/orchest

PandaAI-Tech/panda_factor

استكشف الوسوم الفرعية