38 dépôts
Utilities for distributing computational workloads across multiple processor cores.
Distinguishing note: Focuses on data-parallel execution patterns rather than general concurrency primitives.
Explore 38 awesome GitHub repositories matching data & databases · Parallel Processing. Refine with filters or upvote what's useful.
Career-ops is an AI-driven job search automation system designed to manage the entire application lifecycle, from discovery to tracking. It functions as a career copilot that utilizes autonomous agents to identify vacancies, evaluate professional fit, and generate tailored application materials. The project distinguishes itself through a multi-archetype persona management system and writing style calibration, allowing users to maintain different professional identities and a consistent voice across documents. It employs a multi-dimensional weighted scoring system to evaluate job suitability a
Accelerates the screening process by using autonomous sub-agents to evaluate multiple job opportunities in parallel.
RxJava is a reactive stream processing framework and JVM reactive extensions library. It serves as an asynchronous dataflow orchestrator used to compose event-based programs by transforming, combining, and consuming real-time data flows on the Java Virtual Machine. The project distinguishes itself through integrated backpressure flow control, which manages the emission rate between producers and consumers to prevent memory exhaustion. It further provides mechanisms for concurrent thread management and parallel data processing to offload blocking operations and maintain application responsiven
Enables parallel processing of data streams across multiple processor cores for increased efficiency.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Distributes data processing tasks across available CPU cores to maximize throughput.
cloc is a codebase metrics tool and multi-language code analyzer designed to count blank lines, comment lines, and physical lines of code. It serves as a source code line counter and report generator that identifies file types to calculate source volume across a wide variety of programming languages. The tool distinguishes itself by providing codebase version comparison to measure relative changes in source and comment lines between two versions of a directory or archive. It also supports the definition of custom languages and the extension of language recognition by loading custom comment fi
Distributes file scanning and line counting workloads across multiple CPU cores to accelerate analysis of large codebases.
oxc is a high-performance JavaScript toolchain developed in Rust for parsing, transforming, and analyzing JavaScript and TypeScript source code. It provides a set of core utilities including a parser that converts code into an abstract syntax tree, a linter for identifying problematic patterns, a formatter for standardizing visual style, and a minifier for reducing production file sizes. The project focuses on high-performance execution through a system design that utilizes single-pass parsing, zero-copy string slicing, and parallel worker processing to handle large codebases. It further opti
Distributes linting and formatting tasks across multiple CPU cores to process large codebases concurrently.
This project is a comprehensive educational resource and technical documentation suite for learning and developing deep learning models. It serves as an open-source textbook, implementation manual, and framework tutorial designed to guide users through the mathematical foundations and practical application of neural networks. The resource provides detailed instructional content on building various model architectures, including convolutional and recurrent neural networks. It includes a dedicated distributed training guide and a learning path that covers the fundamentals of tensors, automatic
Instructs on distributing data loading and preprocessing workloads across multiple processors to increase throughput.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Distributes complex analytical workloads across a cluster of nodes for parallel execution.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Splits large PDF files into page batches for concurrent processing to reduce total execution time.
Dask est un framework de calcul parallèle et un planificateur de tâches distribué conçu pour mettre à l'échelle les flux de travail de science des données Python, des machines uniques aux grands clusters. Il fonctionne comme un gestionnaire de ressources de cluster qui orchestre la logique computationnelle en représentant les tâches et leurs dépendances sous forme de graphes acycliques dirigés. Cette architecture permet au système d'automatiser la distribution des charges de travail sur le matériel disponible tout en gérant des exigences d'exécution complexes. Le projet se distingue par un moteur d'évaluation paresseuse qui diffère les opérations sur les données jusqu'à ce qu'elles soient explicitement demandées, permettant une optimisation globale du graphe et une allocation efficace des ressources. Il intègre le déversement de données conscient de la mémoire pour éviter les plantages du système lors du traitement de jeux de données dépassant la mémoire disponible, et il utilise la fusion de graphes de tâches pour combiner des séquences d'opérations en étapes d'exécution uniques, minimisant la surcharge de planification et la communication entre nœuds. La plateforme fournit une surface de capacités complète pour l'analyse de données à grande échelle, incluant le support pour l'apprentissage automatique distribué, l'intégration du calcul haute performance et le traitement de données parallèle. Elle offre des outils étendus pour la gestion du cycle de vie des clusters, le profilage des performances et la surveillance en temps réel de l'exécution des tâches. Les utilisateurs peuvent déployer ces environnements sur diverses infrastructures, incluant le matériel local, les fournisseurs cloud, les systèmes conteneurisés et les clusters de calcul haute performance.
Scales data analysis workflows by distributing computational tasks across multiple cores and distributed cluster nodes.
Rayon is a data parallelism library for Rust that provides a framework for converting sequential computations into parallel operations. It enables the transformation of standard data structures and loops into parallel iterators, allowing workloads to be distributed across multiple processor cores. By utilizing a work-stealing scheduler, the library dynamically balances tasks to maximize throughput and minimize execution time. The library distinguishes itself through its focus on safe, scoped task synchronization, which ensures that all spawned operations complete before a scope exits to preve
Provides parallel iterators and work-stealing algorithms to distribute data processing workloads across multiple CPU cores.
Trino is a distributed SQL query engine designed for large-scale data analytics. It functions as a data federation platform, providing a unified interface that allows users to execute complex analytical queries across multiple heterogeneous data sources simultaneously without requiring data movement or transformation. The engine utilizes a massively parallel processing architecture to scale compute resources across clusters for high-speed data retrieval. It distinguishes itself through a cost-based query optimizer that analyzes metadata to determine efficient execution plans, alongside dynami
Utilizes a massively parallel processing engine to scale compute resources for high-speed data retrieval.
Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI. The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue manag
Runs multiple independent processing branches simultaneously and coordinates their output into a single stream.
Cpp-taskflow is a C++ task-parallelism framework and task graph scheduler designed to manage and execute complex dependency graphs of parallel tasks across CPU and GPU hardware. It provides a parallel algorithm library for high-performance implementations of reductions, sorts, pipelines, and iterations. The framework distinguishes itself through its ability to offload heavy computational workloads from a task graph to graphics processors for acceleration. It also includes a task profiling tool and a performance analysis interface for visualizing task execution flow and dependency structures t
Implements nested dependency structures within C++ workflows to handle recursive computations in parallel.
Taskflow is a C++ task-parallel framework designed to build high-performance parallel workflows and complex dependency graphs. It provides a programming model that organizes computational work into directed acyclic graphs, enabling developers to manage concurrency, resource scheduling, and task dependencies across multi-core CPUs and GPU accelerators. The framework distinguishes itself through its ability to orchestrate heterogeneous systems, allowing for the integration of hardware-accelerated kernels and memory operations into unified execution pipelines. It supports dynamic runtime subflow
Combines two sorted sequences into a single sorted sequence using parallel processing to improve throughput.
Clojure is a general-purpose, functional programming language hosted on the Java Virtual Machine. It is a homoiconic S-expression language that represents programs as nested data structures, allowing code to be manipulated and evaluated as data. The project provides a framework for JVM interoperability, enabling the invocation of Java methods and integration with other JVM-based languages. It distinguishes itself through a persistent data structure library that uses bitmapped vector tries to manage immutable collections and a programmatic REPL for interactive software development and real-tim
Implements parallel data processing by splitting large tasks across available CPU cores using a fork-join model.
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Distributes data manipulation workloads across multiple processor cores to increase throughput for large scale dataframes.
Metaflow is a Python machine learning framework and MLOps workflow orchestrator designed to manage the lifecycle of data pipelines from local prototyping to production. It serves as a distributed compute manager and an experiment tracking system, enabling the creation of reproducible pipelines that transition between development and high-availability production environments. The framework distinguishes itself through an integrated checkpointing system that automatically persists intermediate data artifacts to remote storage, allowing failed runs to be resumed from the last successful step. It
Resolves and propagates data artifacts from multiple parallel branches into a single join step.
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
Distributes heavy statistical feature extraction workloads across multiple CPU cores to process large time series datasets efficiently.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Executes high-speed grouping and aggregation operations across multiple processor cores to accelerate analysis.
This is a scikit-learn automated machine learning framework designed to optimize model selection and hyperparameters. It functions as an automated model selector and hyperparameter optimization tool for classification and regression tasks, utilizing an automated ensemble builder to combine high-performing models for increased predictive accuracy. The system features a distributed search engine that uses Dask for parallel machine learning optimization across CPU cores or clusters. It implements a budget-based evaluation strategy through successive halving to prioritize promising model configur
Distributes automated searches and ensemble building across a cluster using Dask for large-scale parallel processing.