9 repositorios
Compute-intensive operations for file, API, and database interactions.
Distinguishing note: Focuses on data-heavy operations rather than general task orchestration.
Explore 9 awesome GitHub repositories matching data & databases · Data Processing Tasks. Refine with filters or upvote what's useful.
Kestra is a declarative workflow orchestrator designed to manage complex task dependencies and automated processes through versioned configuration files. It functions as a distributed platform that decouples task scheduling from execution by offloading computational workloads to a fleet of worker nodes. The system uses a reactive, event-driven engine to initiate workflows automatically in response to external signals, webhooks, schedules, or file system changes. The platform distinguishes itself through a modular plugin architecture that allows for the integration of custom tasks and external
Offloads compute-intensive data tasks like file operations and API requests to dedicated worker components.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Encapsulates units of computation by specifying input requirements and output targets.
CVAT is an open-source, web-based platform designed for annotating images, videos, and 3D point clouds to create high-quality training datasets for machine learning. It functions as a containerized server that orchestrates the entire lifecycle of computer vision data, from initial task creation and manual labeling to quality assurance and final dataset export. The platform distinguishes itself through deep integration with machine learning models, allowing users to deploy custom AI models as serverless functions for automated object detection, tracking, and skeleton annotation. It supports co
Enables bulk labeling and data organization across multiple files or frames using automated scripts.
FileCentipede is a comprehensive file management and transfer application designed to handle diverse network protocols and data operations. It functions as a multi-protocol download manager, a full-featured BitTorrent client, and a remote filesystem manager, providing a unified interface for moving and organizing data across local and remote environments. The application distinguishes itself through deep browser integration, which allows for the direct capture of media streams, video, and bulk download links from web pages. It also includes a modular utility suite that enables users to perfor
Executes compute-intensive data processing tasks including file merging, checksums, and HTTP request construction.
Records is a SQL database client designed for executing raw queries and managing result sets through a simplified interface. It provides a parameterized SQL executor to bind values to placeholders, ensuring safe data handling and preventing injection attacks, alongside a database transaction manager for grouping operations into atomic units. The project includes a dedicated command-line interface for running database statements and exporting query results directly to local files. This tooling allows for the conversion of SQL result sets into multiple serialization formats, including CSV, JSON
Executes the same SQL query multiple times with different parameters to handle large datasets efficiently.
HomeBox is a self-hosted home inventory manager designed for tracking physical belongings and household assets. It functions as a digital catalog for creating structured databases of objects, including records of locations, categories, and purchase history. The system distinguishes itself through the use of QR code generation to link physical objects to digital records and the support of hierarchical location mapping to track assets across nested environments. It further enables automation via a REST API and centralizes access management through OpenID Connect integration for user authenticat
Supports large-scale imports and exports of inventory records using comma-separated values for efficient batch updates.
StreamPark es una plataforma de gestión centralizada diseñada para coordinar el despliegue, monitoreo y ciclo de vida operativo de aplicaciones de procesamiento de flujos distribuidos y procesamiento por lotes (batch). Funciona como un plano de control y orquestador para pipelines de datos, proporcionando específicamente capacidades de gestión para entornos Apache Flink y Hadoop YARN. La plataforma se distingue por un enfoque de bajo código para el despliegue de tareas y un adaptador de ejecución multi-motor que admite diversos runtimes de procesamiento. Facilita la gestión de pipelines de datos en tiempo real combinando análisis SQL de streaming con un pipeline de despliegue basado en recursos que maneja el versionado, subidas de binarios y recuperación de estado basada en savepoints. El sistema cubre un amplio conjunto de capacidades, incluyendo orquestación de trabajos distribuidos, integración de datos en tiempo real a través de conectores preconstruidos e integración de identidad a través de LDAP o SSO. También proporciona herramientas de observabilidad para el monitoreo de aplicaciones de segundo nivel y notificaciones operativas automatizadas de fallos.
Handles compute-intensive operations including startup, savepoints, and performance analysis of stream jobs.
GAM is a command-line tool for administering Google Workspace and Cloud Identity. It translates command-line arguments into structured API calls, enabling administrators to manage users, groups, organizational units, and domain settings across a Google Workspace environment. The tool handles authentication through OAuth2 flows, service accounts, and workload identity federation, and supports multi-tenant configurations for managing multiple domains or cloud projects from a single installation. GAM distinguishes itself through its batch processing and automation capabilities. It can process la
Performs bulk updates and exports across Google Workspace services using data sourced from CSV files.
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Transfers bulk email accounts between different hosting providers using administrative or user credentials.