67 Repos
Tools designed for high-throughput, non-real-time data operations, differing from streaming systems by focusing on discrete, chunked data execution.
Explore 67 awesome GitHub repositories matching data & databases · Batch Processing Systems. Refine with filters or upvote what's useful.
Developer Roadmap ist eine Community-gesteuerte Plattform, die strukturierte, graphbasierte Lernpfade für das Software-Engineering bietet. Sie dient als umfassendes Wissens-Repository, in dem technische Bereiche in visuellen Sequenzen organisiert sind, um den Erwerb beruflicher Fähigkeiten und das Karrierewachstum zu steuern. Das Projekt zeichnet sich durch ein kollaboratives Ökosystem aus, das es Nutzern ermöglicht, Roadmaps beizusteuern, bewährte Branchenpraktiken zu kuratieren und berufliche Profile zu pflegen. Es integriert diagnostische Bewertungs-Frameworks, um die technische Kompetenz zu evaluieren, und hilft Entwicklern dabei, Wissenslücken zu identifizieren und sich durch gezielte Lernsequenzen auf professionelle Vorstellungsgespräche vorzubereiten. Über seine Kern-Mapping-Funktionen hinaus bietet die Plattform praktische Projektideen und interaktives Tutoring, um Engineering-Konzepte zu festigen. Sie bietet einen zentralen Raum für die Community, um Ressourcen zu teilen, den fortschreitenden Kompetenzaufbau zu verfolgen und durch komplexe technische Landschaften zu navigieren.
Provides sequential access to elements within large data collections during processing.
Dieses Projekt ist eine umfassende Bildungsressource und ein Studienleitfaden, der sich auf die Architektur verteilter Systeme und das Design von Backend-Infrastrukturen konzentriert. Es bietet einen strukturierten Lehrplan zur Beherrschung der Prinzipien von Skalierbarkeit, Zuverlässigkeit und Leistung, die für den Entwurf komplexer Softwaresysteme erforderlich sind. Das Repository zeichnet sich durch einen methodischen Ansatz zur Vorbereitung auf technische Vorstellungsgespräche aus, der Entwurfsmuster, architektonische Kompromisse und Tools für räumliche Wiederholungen integriert, um Nutzern das Behalten komplexer Konzepte zu erleichtern. Es betont die einschränkungsgesteuerte Analyse und lehrt Nutzer, wie sie konkurrierende Anforderungen wie Latenz, Konsistenz und Verfügbarkeit beim Entwurf von Architekturen bewerten können. Der Inhalt deckt ein breites Spektrum an Systemdesign-Fähigkeiten ab, einschließlich Strategien für die Datenbankskalierung, Verkehrsmanagement und Infrastrukturoptimierung. Es werden Techniken für horizontale Skalierung, mehrschichtiges Caching, asynchrone Kommunikation und Service-Discovery detailliert beschrieben, während gleichzeitig Frameworks für die Durchführung von Ressourcenschätzungen und Kapazitätsplanungen bereitgestellt werden. Die Dokumentation ist als Studienleitfaden organisiert und bietet einen systematischen Pfad durch die Grundlagen des Backend-Engineerings und des großskaligen Systemdesigns.
Provides helper libraries and scripts that assist in the scheduling, monitoring, and management of batch processing jobs.
Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users to map facial identities between source and destination datasets while maintaining structural alignment and lighting consistency across video frames. The project distinguishes itself through a highly extensible plugin-based architecture that handles hardware-accelerated process
Performs batch operations on aligned data by adjusting matrices and extracting specific regions from source imagery.
LevelDB is an embedded database library and persistent storage engine that provides a sorted key-value store. It uses a log-structured merge-tree architecture to map byte arrays to values, running directly within a process to provide storage without the need for a separate server process. The system is distinguished by its use of custom comparison functions to define key ordering, enabling efficient range scans and sequenced lookups. It ensures data reliability through atomic batch execution, consistent snapshot generation, and log-based recovery after failures. The engine covers broad capab
Provides sequential iterators for traversing stored entries in forward or backward order.
Immutable.js is a library of persistent data structures and a functional state management toolkit. It provides a collection of immutable objects and arrays that prevent direct mutation to ensure predictable state management in JavaScript applications. The library utilizes structural sharing to efficiently create new versions of data without full copying and implements lazy sequence processing to chain data transformations that execute only when values are requested. It also supports batch mutation processing, allowing multiple changes to be applied to a temporary mutable copy before returning
Implements memory-efficient lazy iterators that defer data transformations until values are explicitly requested.
Prompt Optimizer is a framework designed for the iterative refinement and testing of text-based instructions for large language models. It functions as an automated evaluation pipeline that systematically adjusts prompt structure, constraints, and clarity to improve the accuracy and consistency of model outputs. The system distinguishes itself through a model-agnostic interface that standardizes communication across different artificial intelligence providers. It incorporates a versioned asset management system to track prompt history, enabling developers to maintain consistency and perform r
Executes multiple test cases in parallel to measure performance metrics and verify the reliability of prompt changes.
VoxCPM is a multilingual speech synthesis system and text-to-speech inference server. It functions as an AI voice cloning tool and a synthetic voice designer, capable of generating natural speech across global languages and regional dialects using a GPU-accelerated audio generator. The project features a speech model fine-tuning framework that supports both full parameter updates and low-rank adaptation for customizing voice characteristics. It enables high-fidelity voice cloning from reference audio, including cross-lingual voice transfer and acoustic environment mimicry, as well as the crea
Converts text files into separate audio files by treating each line as an individual synthesis task.
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Executes prompt logic across multiple inputs simultaneously to improve throughput.
Scrapegraph-ai is a Python framework that uses large language models to automate the extraction of structured data from websites and documents. It functions as an AI-driven data extraction pipeline that converts unstructured web content into structured formats using natural language processing and graph-based logic. The project utilizes graph-based task orchestration to model scraping workflows as interconnected nodes. It features a pluggable model interface for connecting to cloud or local artificial intelligence providers and can generate executable Python code on the fly to handle site-spe
Transforms extracted website information into audio files for accessibility or alternative content consumption.
Ultimate Vocal Remover is a desktop application designed for AI-driven audio source separation. It utilizes deep learning models to isolate vocals, drums, and other individual instruments from mixed audio files, providing a utility for professional production and creative editing workflows. The software distinguishes itself by leveraging GPU-accelerated tensor computation to perform complex signal processing tasks, significantly reducing the time required for high-fidelity audio extraction. It incorporates a modular plugin architecture that integrates external utilities to support a wide rang
Automates the separation and conversion of large music libraries through sequential file queuing.
Lama Cleaner is an AI-powered image editing application focused on inpainting, object removal, and generative filling. It provides a suite of tools for erasing unwanted elements from photos and filling the resulting gaps using generative artificial intelligence. The project includes specialized capabilities for image outpainting to extend borders, background removal through object segmentation, and face restoration to fix visual defects. It also features an image upscaler to increase resolution and clarity via super-resolution AI, as well as a Stable Diffusion-based editor for replacing speci
Provides a command-line utility for executing generative filling and expansion tasks across entire image folders.
Rembg is a machine learning-based toolkit designed for automated image background removal and subject segmentation. It functions as a versatile engine that identifies and extracts subjects from images, supporting diverse input methods including individual files, directory-based batch processing, and live binary data streams. The project distinguishes itself through its flexible integration options, offering a command-line interface for local automation, a library for programmatic access, and an HTTP service for remote requests. It utilizes deep learning architectures to classify pixels and ge
The project supports automated background removal for entire directories of images, including watch-folder functionality for real-time processing of new or modified files.
Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams. The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
Implements lazy, memory-efficient iterators to process large datasets on demand without loading them into physical memory.
This library is a collection of generic utilities for the Go programming language designed to simplify the manipulation of slices and maps. It provides a functional toolkit that enables developers to perform data transformations, such as filtering, mapping, and reducing, while maintaining strict type safety through the use of language-level generics. The project distinguishes itself by offering a dual approach to data processing that balances functional programming patterns with performance-oriented execution. It supports both immutable functional pipelines for predictable state transitions a
Provides a comprehensive toolkit for memory-efficient, lazy data traversal and deferred computation of large or infinite sequences in Go.
Excelize is a library for reading and writing spreadsheet files in the Office Open XML format. It provides a comprehensive suite of tools for programmatically creating, modifying, and analyzing workbooks, worksheets, and cell data, ensuring compatibility across various office software suites through structured XML serialization. The library distinguishes itself with a built-in formula calculation engine that evaluates complex mathematical and logical expressions directly against workbook data. It also features a memory-mapped streaming architecture, which allows for the efficient processing o
Emits data iteratively to maintain low memory usage during large-scale file processing.
Wagtail is an open-source content management system built on the Django web framework. It provides a structured, tree-based approach to content modeling, allowing developers to define custom page types and reusable content components that are managed through a highly customizable administrative interface. The platform distinguishes itself through its flexible, block-based content composition system, which enables editors to assemble complex page layouts dynamically. It also offers robust support for multi-site and multi-lingual environments, allowing organizations to manage distinct websites
Generates multiple image renditions in a single batch operation to improve performance.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Ensures data integrity through atomic output handling and automated retry logic for batch processing.
This project is a collection of implementation guides, recipes, and developer resources for building applications with Llama models. It serves as a comprehensive kit for developing autonomous agents, establishing retrieval-augmented generation systems, and executing model fine-tuning. The resource provides specific patterns for multimodal workflows that process text, images, and audio. It includes specialized guidance on adapting pre-trained model weights for targeted tasks and implementing tool-calling orchestration to connect models with external APIs and functions. The codebase covers a b
Transforms PDF content into multi-speaker scripts and audio files using a sequence of specialized models.
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
Improves throughput by executing large-scale reasoning tasks in parallel using dynamic batch sizing.
Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data. The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters. The library covers a broad range of analysis capabilities
Implements data iterators to stream large text collections from disk, avoiding memory exhaustion.