# Stream Processing Engines

> Search results for `stream processing engine for transforming data in flight` on awesome-repositories.com. 113 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/stream-processing-engine-for-transforming-data-in-flight

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/stream-processing-engine-for-transforming-data-in-flight).**

## Results

- [avelino/awesome-go](https://awesome-repositories.com/repository/avelino-awesome-go.md) (175,576 ⭐) — This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently discover resources tailored to specific functional domains.

The repository distinguishes itself through a decentralized contribution model, where community-driven updates ensure the index remains current with the rapidly evolving software landscape. Beyond simple resource listing,
- [datatalksclub/data-engineering-zoomcamp](https://awesome-repositories.com/repository/datatalksclub-data-engineering-zoomcamp.md) (42,483 ⭐) — This project is an open-source educational curriculum designed to provide comprehensive training in data engineering. It focuses on building scalable data pipelines and managing cloud-native infrastructure through a structured, self-paced program that combines technical explanations with hands-on practical exercises.

The curriculum distinguishes itself by emphasizing industry-standard methodologies, specifically teaching students how to implement infrastructure as code and manage data workflows through orchestration tools. By utilizing container-based environment isolation and declarative con
- [flightjs/flight](https://awesome-repositories.com/repository/flightjs-flight.md) (6,493 ⭐) — Flight is a JavaScript component framework and DOM interactivity library used to map behavioral logic to HTML nodes. It provides an event-driven architecture for building modular user interface elements and managing web application interactivity.

The library distinguishes itself through a mixin-based system for injecting reusable functions and properties into components, promoting code reuse without rigid inheritance. It further enables behavior modification via function hooking, allowing developers to wrap existing methods to inject custom logic without altering the original source code.

Th
- [dataexpert-io/data-engineer-handbook](https://awesome-repositories.com/repository/dataexpert-io-data-engineer-handbook.md) (41,758 ⭐) — This project is a comprehensive, community-driven knowledge base designed to support individuals pursuing careers in data engineering. It functions as a centralized learning hub that aggregates industry best practices, technical documentation, and educational resources to assist with both professional development and the design of robust data pipeline architectures.

The repository distinguishes itself by providing a structured technical career roadmap that includes curated learning paths, interview preparation strategies, and practical project examples. By indexing a diverse range of media—in
- [k88hudson/git-flight-rules](https://awesome-repositories.com/repository/k88hudson-git-flight-rules.md) (42,472 ⭐) — git-flight-rules is a collection of curated guidelines, operational resources, and a command reference for managing version control with Git. It provides a set of procedure-based rules and best practices designed to organize code history, branches, and collaborative development.

The project distinguishes itself by providing structured workflows for complex history manipulation and data recovery. This includes specific guidance on rewriting commit history to remove sensitive data, using the reference log to recover lost work, and employing binary searches to isolate regressions.

The resource
- [apache/pinot](https://awesome-repositories.com/repository/apache-pinot.md) (6,098 ⭐) — Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability.

The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
- [huggingface/transformers](https://awesome-repositories.com/repository/huggingface-transformers.md) (161,630 ⭐) — Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and
- [lucidrains/transformer-in-transformer](https://awesome-repositories.com/repository/lucidrains-transformer-in-transformer.md) (0 ⭐)
- [reactivex/rxkotlin](https://awesome-repositories.com/repository/reactivex-rxkotlin.md) (7,041 ⭐) — RxKotlin is a reactive programming library and asynchronous stream processor that provides Kotlin language extensions for composing event-based data streams. It serves as a set of Kotlin bindings for RxJava, allowing developers to transform, filter, and flatten sequences of data emitted over time.

The library focuses on integrating RxJava patterns into Kotlin projects by applying language-specific conventions and idioms. It utilizes extension functions to simplify reactive programming patterns, reduce boilerplate, and optimize workflows within the reactive ecosystem.

The toolkit covers a bro
- [processing/processing](https://awesome-repositories.com/repository/processing-processing.md) (6,487 ⭐) — Processing is a creative coding environment and Java graphics library designed for writing visual sketches that produce interactive 2D and 3D graphics and animations. It runs on the Java Virtual Machine, using an OpenGL-based hardware-accelerated rendering pipeline, and operates on a sketch-based execution model where programs run as continuous loops of setup and draw functions with event-driven input handling for keyboard, mouse, and window interactions.

The environment distinguishes itself as a cross-platform sketch tool that runs visual programs unchanged on desktop, web, Android, and Rasp
- [reactivex/rxpy](https://awesome-repositories.com/repository/reactivex-rxpy.md) (5,014 ⭐) — RxPY is a functional reactive programming library and a ReactiveX observable library for Python. It serves as an asynchronous stream processor and event-driven coordination framework used to build data pipelines that react to changes in state or streams of events over time.

The library provides a toolkit for composing asynchronous and event-based programs using observable sequences and operators. It distinguishes itself through the use of configurable schedulers to manage concurrency, timing, and subscription lifecycles.

The project covers a wide range of stream processing capabilities, incl
- [hazelcast/hazelcast](https://awesome-repositories.com/repository/hazelcast-hazelcast.md) (6,570 ⭐) — Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources.

What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
- [hasbrain/data-engineer-roadmap](https://awesome-repositories.com/repository/hasbrain-data-engineer-roadmap.md) (898 ⭐) — Below you can find a chart demonstrating the paths that you can take and the milestones that you would want to achieve in order to become a data engineer. We spoke to senior data engineers and data engineering managers from top tech companies in the Silicon Valley, and consolidated learnings…
- [facebookresearch/map-anything](https://awesome-repositories.com/repository/facebookresearch-map-anything.md) (2,915 ⭐) — Map-anything is a 3D scene reconstruction framework and neural geometry estimator designed to transform two-dimensional images into metric three-dimensional spatial representations using feed-forward neural networks. It provides a specialized toolkit for predicting camera intrinsics and ray directions from single images without requiring external geometric metadata.

The project includes a 3D model benchmarking suite that utilizes a unified model wrapper to standardize outputs from diverse reconstruction models. This allows for consistent evaluation and accuracy measurement across various spat
- [pathwaycom/pathway](https://awesome-repositories.com/repository/pathwaycom-pathway.md) (62,959 ⭐) — Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources.

The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
- [datastacktv/data-engineer-roadmap](https://awesome-repositories.com/repository/datastacktv-data-engineer-roadmap.md) (12,747 ⭐) — This project is a collection of specialized study guides and roadmaps centered on computer science, data engineering, and machine learning fundamentals. It provides a structured curriculum of technical competencies, tools, and skills required to transition into professional data engineering roles.

The project features a data engineering skill map that visually organizes databases, processing architectures, and infrastructure tools. It also includes a machine learning learning path covering supervised and unsupervised learning techniques alongside model operations.

The curriculum covers broad
- [fastapi/fastapi](https://awesome-repositories.com/repository/fastapi-fastapi.md) (99,260 ⭐) — FastAPI is a web framework for building APIs with Python. It leverages standard language type hints to provide automatic data validation, request parsing, and interactive API documentation generation. The framework supports asynchronous request handling and manages execution contexts to prevent blocking the main event loop.

The project includes a dependency injection system that allows for the resolution and injection of reusable components into request handlers. This system supports request-scoped caching, lifecycle management, and integration with security mechanisms like OAuth2 and JSON We
- [gokumohandas/made-with-ml](https://awesome-repositories.com/repository/gokumohandas-made-with-ml.md) (48,343 ⭐) — Made-With-ML is an automated documentation generator and developer experience platform designed to transform source code into structured, searchable reference websites. It functions as a codebase intelligence tool that parses implementation details to provide clear explanations of logic and data requirements.

The system distinguishes itself by leveraging language-level type annotations and structured code comments to generate interface specifications. By utilizing static analysis to extract metadata, it automates the transformation of docstrings into web-ready documentation, ensuring that tec
- [apache/pulsar](https://awesome-repositories.com/repository/apache-pulsar.md) (15,276 ⭐) — Apache Pulsar is a cloud-native distributed pub-sub messaging system designed for high-performance data ingestion. It functions as a geo-replicated data streamer and a multi-tenant event streaming platform, providing a serverless stream processing engine and a tiered storage messaging broker.

The system distinguishes itself by separating serving layers from storage layers to allow independent scaling of compute and data retention. It features native geo-replication to synchronize messages across different geographical regions and employs a multi-layered tenant isolation model using authentica
- [binance/binance-spot-api-docs](https://awesome-repositories.com/repository/binance-binance-spot-api-docs.md) (4,812 ⭐) — This project provides technical documentation and reference guides for spot trading, including specifications for REST, WebSocket, and FIX protocols. It serves as a comprehensive resource for integrating with spot trading endpoints to execute trades, query account data, and fetch market statistics.

The project distinguishes itself by supporting institutional-grade connectivity through the Financial Information eXchange standard and simple binary encoding to reduce latency and payload size. It also includes a dedicated sandbox environment for validating trading logic and strategies without fin
- [apache/flink](https://awesome-repositories.com/repository/apache-flink.md) (26,086 ⭐) — Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations.

The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
- [nanosai/stream-ops-java](https://awesome-repositories.com/repository/nanosai-stream-ops-java.md) (50 ⭐) — Stream Ops is a fully embeddable data streaming engine and stream processing API for Java.
- [apache/incubator-storm](https://awesome-repositories.com/repository/apache-incubator-storm.md) (6,683 ⭐) — Apache Storm is a distributed stream processing framework and real-time data processing engine. It functions as a fault-tolerant distributed computing system designed to analyze data in motion across a cluster of machines for continuous stream computation.

The system enables the creation of fault-tolerant data pipelines and scalable event processing by distributing workloads across a network of computing nodes. This architecture ensures low latency and high throughput for live data while allowing the system to recover automatically from individual node failures.

The framework provides capabi
- [fastai/fastai](https://awesome-repositories.com/repository/fastai-fastai.md) (27,862 ⭐) — Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models.

The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
- [o0morgan0o/gcode-generative-for-processing](https://awesome-repositories.com/repository/o0morgan0o-gcode-generative-for-processing.md) (33 ⭐) — Morgan Thibert -- 2019 -- Library for Processing 3
- [redpanda-data/connect](https://awesome-repositories.com/repository/redpanda-data-connect.md) (8,681 ⭐) — Connect is a Kafka data integration platform and stream processing engine used to build declarative pipelines that move and transform messages between Kafka topics and external sources. It functions as a Kafka Connect framework and a change data capture tool, streaming real-time database modifications to synchronize data across distributed environments.

The project differentiates itself through a dedicated mapping language for mutating and reshaping message payloads and the ability to execute custom processing logic within a sandboxed WebAssembly runtime. It also provides an observability pip
- [fastai/course-v3](https://awesome-repositories.com/repository/fastai-course-v3.md) (4,914 ⭐) — This repository is a comprehensive educational program and deep learning framework designed to teach practical deep learning using PyTorch through notebooks and code examples. It serves as a high-level library for building, training, and deploying neural networks, acting as a model training orchestrator that coordinates PyTorch models, optimizers, and loss functions.

The project provides specialized toolkits for computer vision, natural language processing, and tabular data preprocessing. It distinguishes itself through advanced training controls such as discriminative learning rates, a two-w
- [dagster-io/dagster](https://awesome-repositories.com/repository/dagster-io-dagster.md) (14,974 ⭐) — Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality.

The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
- [taskflow/taskflow](https://awesome-repositories.com/repository/taskflow-taskflow.md) (12,013 ⭐) — Taskflow is a C++ task-parallel framework designed to build high-performance parallel workflows and complex dependency graphs. It provides a programming model that organizes computational work into directed acyclic graphs, enabling developers to manage concurrency, resource scheduling, and task dependencies across multi-core CPUs and GPU accelerators.

The framework distinguishes itself through its ability to orchestrate heterogeneous systems, allowing for the integration of hardware-accelerated kernels and memory operations into unified execution pipelines. It supports dynamic runtime subflow
- [kimberlymunoz/empathy-in-engineering](https://awesome-repositories.com/repository/kimberlymunoz-empathy-in-engineering.md) (577 ⭐) — A curated list of resources for building and promoting more compassionate engineering cultures
- [apache/streampark](https://awesome-repositories.com/repository/apache-streampark.md) (4,312 ⭐) — StreamPark is a centralized management platform designed to coordinate the deployment, monitoring, and operational lifecycle of distributed stream processing and batch applications. It functions as a control plane and orchestrator for data pipelines, specifically providing management capabilities for Apache Flink and Hadoop YARN environments.

The platform distinguishes itself through a low-code approach to task deployment and a multi-engine execution adapter that supports diverse processing runtimes. It facilitates real-time data pipeline management by combining streaming SQL analytics with a
- [functional-streams-for-scala/fs2](https://awesome-repositories.com/repository/functional-streams-for-scala-fs2.md) (2,447 ⭐) — Compositional, streaming I/O library for Scala
- [matz/streem](https://awesome-repositories.com/repository/matz-streem.md) (4,598 ⭐) — Streem is a stream-based programming language and data pipeline orchestrator. It provides a domain-specific language for defining concurrent data flows, allowing users to link data sources to destinations through a sequence of operations that transform and filter individual stream elements.

The system uses a custom script syntax to define data-flow connections and pipeline definitions. This allows for the orchestration of concurrent data processing where multiple pipeline stages execute simultaneously to move data elements through the system.

The platform covers functional data transformatio
- [symfony/process](https://awesome-repositories.com/repository/symfony-process.md) (7,463 ⭐) — Symfony Process is a PHP library for executing external commands in separate operating-system processes with full lifecycle control. It provides a cross-platform command executor that handles OS-specific argument escaping and process management, enabling portable subprocess execution from PHP applications.

The library supports both synchronous and asynchronous process execution, allowing background subprocesses to run independently while the main PHP script continues. It includes executable path resolution to locate system commands across standard search directories, stream-based I/O pipes fo
- [apache/spark](https://awesome-repositories.com/repository/apache-spark.md) (43,467 ⭐) — Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine.

The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets.

The engine incorporates relational query e
- [hadley/r4ds](https://awesome-repositories.com/repository/hadley-r4ds.md) (5,070 ⭐) — r4ds is a data science curriculum and educational resource designed for mastering the R programming language. It provides a structured learning path for the end-to-end process of importing, tidying, transforming, and visualizing data.

The project emphasizes a reproducible data science guide and a comprehensive curriculum for data wrangling. It includes specialized tutorials on the grammar of graphics for layered data visualization and technical publications created with Quarto that blend executable code with narrative prose.

The material covers a broad range of analytical capabilities, inclu
- [mosaicml/streaming](https://awesome-repositories.com/repository/mosaicml-streaming.md) (1,521 ⭐) — A Data Streaming Library for Efficient Neural Network Training
- [pentaho/pentaho-kettle](https://awesome-repositories.com/repository/pentaho-pentaho-kettle.md) (8,353 ⭐) — Pentaho Kettle is an enterprise ETL data integration platform designed to extract, transform, and load data between disparate sources and target databases. It functions as a metadata-driven orchestrator that utilizes a visual workflow designer to create and manage complex sequences of data tasks and transformation pipelines.

The system is distinguished by its distributed data processing engine, which executes workloads across clusters of server nodes to increase throughput. It employs a plugin-based architecture, allowing the platform to be extended via external JAR files to provide connectiv
- [fahadshamshad/awesome-transformers-in-medical-imaging](https://awesome-repositories.com/repository/fahadshamshad-awesome-transformers-in-medical-imaging.md) (1,287 ⭐) — A collection of resources on applications of Transformers in Medical Imaging.
- [appsmithorg/appsmith](https://awesome-repositories.com/repository/appsmithorg-appsmith.md) (40,051 ⭐) — Appsmith is a low-code platform designed for building internal business tools, such as operational dashboards and administrative panels. It enables developers to construct dynamic user interfaces by dragging and dropping modular widgets onto a canvas and binding them directly to backend data sources. The platform utilizes a reactive framework that automatically updates interface elements and triggers functions whenever underlying data or widget properties change, eliminating the need for manual event handling.

The platform distinguishes itself through a server-side proxy architecture that exe
- [spark-notebook/spark-notebook](https://awesome-repositories.com/repository/spark-notebook-spark-notebook.md) (3,144 ⭐) — This project is an interactive, web-based notebook environment designed for distributed data science and large-scale computing. It serves as a development tool for executing code and performing data analysis specifically within the Apache Spark framework, providing a browser-based interface that combines code execution with reactive data visualization.

The platform distinguishes itself through its deep integration with distributed infrastructure, allowing users to manage cluster resources, configure runtime dependencies, and isolate execution processes for individual notebooks. It supports co
- [reactive-streams/reactive-streams-jvm](https://awesome-repositories.com/repository/reactive-streams-reactive-streams-jvm.md) (4,875 ⭐) — This project provides a formal specification and a set of standard Java interfaces for asynchronous stream processing. It defines a standardized protocol for passing sequences of elements between publishers and subscribers across different threads, centering on a reactive streams specification for the JVM.

The project focuses on interoperability by providing a common API that allows different asynchronous streaming libraries to work together. This is achieved through a standard set of interfaces and bridging mechanisms that translate between incompatible streaming specifications.

The specifi
- [kamranahmedse/developer-roadmap](https://awesome-repositories.com/repository/kamranahmedse-developer-roadmap.md) (357,434 ⭐) — Developer Roadmap is a community-driven platform that provides structured, graph-based learning paths for software engineering. It serves as a comprehensive knowledge repository where technical domains are organized into visual sequences to guide professional skill acquisition and career growth.

The project distinguishes itself through a collaborative ecosystem that enables users to contribute roadmaps, curate industry best practices, and maintain professional profiles. It integrates diagnostic assessment frameworks to evaluate technical proficiency, helping developers identify knowledge gaps
- [lisadziuba/marketing-for-engineers](https://awesome-repositories.com/repository/lisadziuba-marketing-for-engineers.md) (13,153 ⭐) — Marketing-for-Engineers is a curated knowledge base and set of conceptual guides designed to help developers implement growth strategies, product marketing, and user acquisition methods. It serves as a structured resource for learning how to acquire initial users and scale digital products.

The project provides specific frameworks for content marketing, user acquisition strategies, and marketing automation. It includes guides for creating search engine optimized articles, executing cold outreach, and utilizing influencer partnerships to gain traction.

The repository covers a broad range of g
- [joelgrus/data-science-from-scratch](https://awesome-repositories.com/repository/joelgrus-data-science-from-scratch.md) (9,636 ⭐) — This project is a collection of foundational machine learning algorithms and data science tools implemented in Python. It focuses on building the logic of these tools using basic programming primitives rather than relying on specialized libraries.

The implementation covers several core domains, including a linear algebra library for matrix and vector operations, a statistical analysis toolkit for probability and hypothesis testing, and a framework for map-reduce distributed processing. It also includes implementations for natural language processing, graph theory for network analysis, and var
- [microsoft/ai-agents-for-beginners](https://awesome-repositories.com/repository/microsoft-ai-agents-for-beginners.md) (67,369 ⭐) — This project is a structured educational resource and technical guide for designing and implementing autonomous systems using large language models. It provides a comprehensive curriculum and code samples focused on agentic design patterns, autonomous development, and the creation of systems capable of planning and executing multi-step tasks.

The resource details the implementation of agentic retrieval-augmented generation, where models autonomously plan and refine data searches. It covers a wide array of orchestrators and design patterns, including metacognitive reflection for self-correctin
- [tencentmusic/cube-studio](https://awesome-repositories.com/repository/tencentmusic-cube-studio.md) (5,062 ⭐) — Cube Studio is a cloud-native MLOps platform and Kubernetes-based AI orchestrator designed for the entire machine learning lifecycle. It provides a distributed training framework for large-scale model fine-tuning, a GPU resource manager for hardware virtualization, and an ML pipeline orchestrator that uses visual directed acyclic graphs to manage end-to-end workflows.

The platform distinguishes itself through its specialized LLM inference server, which supports retrieval-augmented generation and the construction of private knowledge bases. It features a dedicated system for supervised fine-tu
- [mafintosh/stream-each](https://awesome-repositories.com/repository/mafintosh-stream-each.md) (38 ⭐) — Iterate all the data in a stream
- [datajuicer/data-juicer](https://awesome-repositories.com/repository/datajuicer-data-juicer.md) (6,574 ⭐) — Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines.

The project distinguishes itself through a YAML-based data recipe sys
- [leeoniya/uplot](https://awesome-repositories.com/repository/leeoniya-uplot.md) (10,266 ⭐) — uPlot is a high-performance canvas time series charting library designed to render millions of data points with high frame rates. It functions as a high-frequency data visualizer and a real-time data stream plotter, utilizing the HTML5 Canvas API to maintain responsiveness when plotting large temporal datasets.

The project distinguishes itself as a plugin-based visualization framework that allows for custom renderers to create specialized visuals such as heatmaps and box-and-whisker plots. It also serves as an interactive financial charting tool, specifically supporting OHLC charts, bars, and
