Why is apache/airflow a recommended Batch Processing Schedulers GitHub Repositories repository?

Define and monitor complex data pipelines using code-based configurations that support dynamic task generation to automate recurring business processes.

Why is spotify/luigi a recommended Batch Processing Schedulers GitHub Repositories repository?

Automates and manages the execution of complex batch data processing pipelines across distributed environments.

Why is argoproj/argo a recommended Batch Processing Schedulers GitHub Repositories repository?

Runs recurring jobs on a fixed timetable using cron-based schedules for routine maintenance and data tasks.

Why is argoproj/argo-workflows a recommended Batch Processing Schedulers GitHub Repositories repository?

Runs periodic data processing jobs and routine infrastructure maintenance tasks on a fixed schedule or triggered by external events.

Why is hashicorp/nomad a recommended Batch Processing Schedulers GitHub Repositories repository?

Schedules high-throughput concurrent tasks and parameterized workloads for data analytics and background processing.

Why is unstructured-io/unstructured a recommended Batch Processing Schedulers GitHub Repositories repository?

Manages asynchronous document transformation jobs by queuing requests, tracking job status, and retrieving processed output files upon completion.

Why is dask/dask a recommended Batch Processing Schedulers GitHub Repositories repository?

Distributes inference workloads across multiple processing units to apply trained models to large volumes of data.

Why is graphql/dataloader a recommended Batch Processing Schedulers GitHub Repositories repository?

Controls when a batch of collected loads is dispatched, enabling manual triggering or delayed execution.

Why is anionex/banana-slides a recommended Batch Processing Schedulers GitHub Repositories repository?

Manages large-scale generation tasks with support for error handling, progress tracking, and state persistence.

Why is icloud-photos-downloader/icloud_photos_downloader a recommended Batch Processing Schedulers GitHub Repositories repository?

Executes recurring data transfer jobs at regular intervals to keep local storage synchronized.

15 रिपॉजिटरी

Awesome GitHub RepositoriesBatch Processing Schedulers

Systems designed to automate and manage the execution of recurring data processing jobs.

Distinguishing note: Specifically targets batch-oriented workflow scheduling rather than general-purpose task automation.

Explore 15 awesome GitHub repositories matching data & databases · Batch Processing Schedulers. Refine with filters or upvote what's useful.

AI के साथ बेहतरीन रिपॉजिटरी खोजें।हम AI का उपयोग करके सबसे सटीक रिपॉजिटरी खोजेंगे।

apache/airflow
apache/airflow
45,902GitHub पर देखें
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Define and monitor complex data pipelines using code-based configurations that support dynamic task generation to automate recurring business processes.
Pythonairflowapacheapache-airflow
GitHub पर देखें45,902
spotify/luigi
spotify/luigi
18,676GitHub पर देखें
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Automates and manages the execution of complex batch data processing pipelines across distributed environments.
Pythonhadoopluigiorchestration-framework
GitHub पर देखें18,676
argoproj/argo
argoproj/argo
16,770GitHub पर देखें
Argo is a cloud native CI/CD platform and Kubernetes workflow engine. It functions as a container pipeline orchestrator and job scheduler, managing multi-step sequences of containers as jobs using directed acyclic graphs within a cluster. The system acts as a progressive delivery controller, reducing release risk through automated Canary and Blue-Green deployment strategies. It provides declarative GitOps synchronization to mirror the state of a git repository directly into the cluster environment for continuous delivery automation. The platform covers a broad range of capabilities including
Runs recurring jobs on a fixed timetable using cron-based schedules for routine maintenance and data tasks.
Go
GitHub पर देखें16,770
argoproj/argo-workflows
argoproj/argo-workflows
16,466GitHub पर देखें
Argo Workflows is a container-native workflow engine that functions as a Kubernetes custom resource controller. It orchestrates complex sequences of containerized tasks by executing them as directed acyclic graphs, allowing for dependency management and parallel processing within a cluster. The system extends the native Kubernetes control plane to manage the full lifecycle of automated processes, from initial triggering to final resource cleanup. The platform distinguishes itself through its controller-pattern reconciliation, which continuously monitors workflow states to align them with desi
Runs periodic data processing jobs and routine infrastructure maintenance tasks on a fixed schedule or triggered by external events.
Goairflowargoargo-workflows
GitHub पर देखें16,466
hashicorp/nomad
hashicorp/nomad
16,211GitHub पर देखें
Nomad is a distributed workload orchestrator and infrastructure automation platform designed to manage the lifecycle of applications across large-scale, heterogeneous environments. It functions as a multi-cloud orchestration engine, providing a unified control plane to deploy, scale, and govern containers, virtual machines, and legacy applications. By utilizing declarative job specifications, the system ensures infrastructure convergence and maintains the desired state across distributed data centers and geographic regions. The platform distinguishes itself through a flexible, plugin-based ar
Schedules high-throughput concurrent tasks and parameterized workloads for data analytics and background processing.
Go
GitHub पर देखें16,211
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019GitHub पर देखें
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Manages asynchronous document transformation jobs by queuing requests, tracking job status, and retrieving processed output files upon completion.
HTMLdata-pipelinesdeep-learningdocument-image-analysis
GitHub पर देखें14,019
dask/dask
dask/dask
13,746GitHub पर देखें
Dask एक पैरेलल कंप्यूटिंग फ्रेमवर्क और डिस्ट्रीब्यूटेड टास्क शेड्यूलर है जिसे Python डेटा साइंस वर्कफ़्लो को सिंगल मशीनों से बड़े क्लस्टर्स तक स्केल करने के लिए डिज़ाइन किया गया है। यह एक क्लस्टर रिसोर्स मैनेजर के रूप में कार्य करता है जो कार्यों और उनकी डिपेंडेंसी को डायरेक्टेड एसाइक्लिक ग्राफ (DAGs) के रूप में प्रस्तुत करके कम्प्यूटेशनल लॉजिक को व्यवस्थित करता है। यह आर्किटेक्चर सिस्टम को जटिल निष्पादन आवश्यकताओं का प्रबंधन करते हुए उपलब्ध हार्डवेयर पर वर्कलोड के वितरण को स्वचालित करने की अनुमति देता है। यह प्रोजेक्ट एक लेज़ी इवैल्यूएशन इंजन के माध्यम से खुद को अलग करता है जो डेटा ऑपरेशन्स को तब तक स्थगित कर देता है जब तक कि उन्हें स्पष्ट रूप से अनुरोध न किया जाए, जिससे ग्लोबल ग्राफ ऑप्टिमाइज़ेशन और कुशल संसाधन आवंटन सक्षम होता है। इसमें उपलब्ध मेमोरी से अधिक डेटासेट को प्रोसेस करते समय सिस्टम क्रैश को रोकने के लिए मेमोरी-अवेयर डेटा स्पिलिंग शामिल है, और यह टास्क ग्राफ फ्यूजन का उपयोग ऑपरेशन्स के अनुक्रमों को एकल निष्पादन चरणों में संयोजित करने के लिए करता है, जिससे शेड्यूलिंग ओवरहेड और इंटर-नोड संचार कम हो जाता है। यह प्लेटफॉर्म बड़े पैमाने पर डेटा एनालिटिक्स के लिए एक व्यापक क्षमता सतह प्रदान करता है, जिसमें डिस्ट्रीब्यूटेड मशीन लर्निंग, उच्च-प्रदर्शन कंप्यूटिंग एकीकरण, और पैरेलल डेटा प्रोसेसिंग के लिए समर्थन शामिल है। यह क्लस्टर लाइफसाइकिल मैनेजमेंट, परफॉरमेंस प्रोफाइलिंग, और टास्क निष्पादन की रीयल-टाइम मॉनिटरिंग के लिए व्यापक उपकरण प्रदान करता है। उपयोगकर्ता इन वातावरणों को स्थानीय हार्डवेयर, क्लाउड प्रदाताओं, कंटेनरीकृत सिस्टम, और उच्च-प्रदर्शन कंप्यूटिंग क्लस्टर्स सहित विविध बुनियादी ढांचे पर तैनात कर सकते हैं।
Distributes inference workloads across multiple processing units to apply trained models to large volumes of data.
Pythondasknumpypandas
GitHub पर देखें13,746
graphql/dataloader
graphql/dataloader
13,380GitHub पर देखें
DataLoader is a utility that collects individual data loads into a single batch and caches results to minimize redundant backend requests. It operates on a batch-and-cache architecture, where multiple data lookups within a single execution frame are grouped together and dispatched as one request, with the results stored in memory for instant retrieval on subsequent calls. The utility distinguishes itself through several key capabilities. It supports per-key error handling, allowing partial failures within a batch without rejecting the entire operation. A cache priming mechanism lets developer
Controls when a batch of collected loads is dispatched, enabling manual triggering or delayed execution.
JavaScriptbatchdataloadergraphql
GitHub पर देखें13,380
anionex/banana-slides
Anionex/banana-slides
12,060GitHub पर देखें
Banana-slides is a generative AI workflow engine designed to automate the creation and refinement of professional slide decks. By leveraging large language models, the platform transforms raw text, structured outlines, and existing documents into visual presentations. It functions as an automated tool that orchestrates the entire lifecycle of a presentation, from initial content generation and layout design to final export. The system distinguishes itself through a modular provider abstraction that allows users to integrate various artificial intelligence services for content and image synthe
Manages large-scale generation tasks with support for error handling, progress tracking, and state persistence.
Pythonai-ppt-makerai-slide-builderai-slides
GitHub पर देखें12,060
icloud-photos-downloader/icloud_photos_downloader
icloud-photos-downloader/icloud_photos_downloader
12,046GitHub पर देखें
This tool is a command-line utility designed to synchronize and archive media from cloud storage to local directories. It functions as an automated backup service that maintains a local mirror of remote photo libraries, ensuring that local storage remains current with remote changes through periodic monitoring and incremental updates. The project distinguishes itself through its support for persistent, containerized background execution, which allows for continuous, automated management of media collections. It provides robust multi-account isolation, enabling users to manage multiple indepen
Executes recurring data transfer jobs at regular intervals to keep local storage synchronized.
Python
GitHub पर देखें12,046
feast-dev/feast
feast-dev/feast
6,727GitHub पर देखें
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Runs a batch engine on a recurring schedule to materialize features.
Pythonbig-datadata-engineeringdata-quality
GitHub पर देखें6,727
qor/qor
qor/qor
5,345GitHub पर देखें
Qor is a Go admin framework and backend toolkit used for building administrative interfaces, headless content management systems, and REST API generators. It provides a structured environment for implementing business application backends, specializing in the management of structured content and media assets. The project distinguishes itself through comprehensive multi-language content management, featuring locale-based data versioning and a dedicated system for internationalization and translation administration. It further differentiates its offering with a built-in state machine implementa
Provides a system for executing background tasks and jobs on a defined schedule.
Goadminapicms
GitHub पर देखें5,345
vogler/free-games-claimer
vogler/free-games-claimer
4,142GitHub पर देखें
यह प्रोजेक्ट एक स्वचालित डिजिटल कंटेंट क्लेमर और गेम स्टोर ऑटोमेशन बॉट है। यह एक हेडलेस क्लाइंट के रूप में कार्य करता है जो शेड्यूल पर मुफ्त डिजिटल गेम और डाउनलोड करने योग्य सामग्री एकत्र करने के लिए अकाउंट ऑथेंटिकेशन और रिक्वेस्ट सीक्वेंस को संभालता है। यह टूल Epic Games Store, GOG, और Amazon Prime Gaming के लिए विशिष्ट ऑटोमेशन प्रदान करता है। यह सीमित समय के ऑफ़र सुरक्षित करने और मैन्युअल ब्राउज़र हस्तक्षेप के बिना एक डिजिटल गेम लाइब्रेरी बनाने के लिए स्टोरफ्रंट-विशिष्ट एडेप्टर लॉजिक का उपयोग करता है। सिस्टम में दैनिक जांच के लिए क्रॉन-आधारित टास्क शेड्यूलिंग, संग्रहीत क्रेडेंशियल्स का उपयोग करके स्वचालित लॉगिन फ़्लो, और हेडलेस ब्राउज़र ऑटोमेशन शामिल है। इसमें एक नोटिफिकेशन सिस्टम भी है जो बाहरी वेबहुक के माध्यम से क्लेम स्टेटस अलर्ट भेजता है।
Schedules recurring batch jobs to execute the content collection process on a fixed daily timetable.
JavaScriptamazon-gamesautomationclaimer
GitHub पर देखें4,142
orchest/orchest
orchest/orchest
4,138GitHub पर देखें
Orchest एक डेटा पाइपलाइन ऑर्केस्ट्रेटर और कंटेनरीकृत वर्कफ़्लो मैनेजर है। यह ग्राफिकल इंटरफ़ेस और स्क्रिप्टिंग के संयोजन के माध्यम से जटिल डेटा प्रोसेसिंग अनुक्रमों को डिज़ाइन करने, शेड्यूल करने और निष्पादित करने के लिए एक प्लेटफ़ॉर्म प्रदान करता है। प्लेटफ़ॉर्म सॉफ़्टवेयर निर्भरताओं का प्रबंधन करने के लिए कंटेनरों का उपयोग करके खुद को अलग करता है, जो विभिन्न वातावरणों में सुसंगत निष्पादन सुनिश्चित करता है। इसमें कई प्रोग्रामिंग भाषाओं में लिखे गए जॉब्स को ट्रिगर करने में सक्षम एक पॉलीग्लॉट टास्क शेड्यूलर है और इसमें एक वर्ज़न कंट्रोल सिस्टम शामिल है जो प्रोजेक्ट कॉन्फ़िगरेशन और कोड के ऐतिहासिक स्नैपशॉट को ट्रैक करता है। सिस्टम विज़ुअल वर्कफ़्लो डिज़ाइन और ग्राफ़-आधारित निर्भरता मैपिंग को कवर करता है, साथ ही आवर्ती या तत्काल निष्पादन के लिए समय-ट्रिगर टास्क शेड्यूलिंग का समर्थन करता है। यह उन स्थायी बैकग्राउंड सेवाओं की तैनाती का भी समर्थन करता है जो पाइपलाइन रन की अवधि के लिए सक्रिय रहती हैं।
Automates and manages the execution of recurring data processing jobs on a scheduled basis.
TypeScriptairflowclouddag
GitHub पर देखें4,138
pandaai-tech/panda_factor
PandaAI-Tech/panda_factor
2,940GitHub पर देखें
Panda Factor is a quantitative trading infrastructure and alpha factor framework. It serves as a backend system for building, calculating, and managing mathematical signals designed to predict the price movements of financial assets. The project functions as a technical indicator engine that generates quantitative metrics from price and volume data. It utilizes a financial data pipeline to automate the synchronization of market data from multiple providers on a nightly schedule. The system provides capabilities for quantitative alpha generation and the construction of financial indicators us
Automates the recurring nightly synchronization of market data from external providers to maintain historical records.
Python
GitHub पर देखें2,940

Awesome Batch Processing Schedulers GitHub Repositories

apache/airflow

spotify/luigi

argoproj/argo

argoproj/argo-workflows

hashicorp/nomad

Unstructured-IO/unstructured

dask/dask

graphql/dataloader

Anionex/banana-slides

icloud-photos-downloader/icloud_photos_downloader

feast-dev/feast

qor/qor

vogler/free-games-claimer

orchest/orchest

PandaAI-Tech/panda_factor

सब-टैग एक्सप्लोर करें