Pathway
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources.
The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features integrated vector-aware data ingestion, which automates the creation and maintenance of searchable document indexes that update instantly as new data arrives. Developers can connect language models directly into their pipelines, utilizing built-in capabilities for document chunking, embedding generation, and result reranking to maintain synchronized, context-aware information retrieval.
Beyond its core processing capabilities, the platform provides a robust infrastructure for deploying data applications. It supports the transition from batch to streaming workflows by simply updating input connectors, while its containerized deployment model allows for scaling services across local and cloud environments. The system is designed to handle large-scale event-driven tasks, providing a consistent programming model for both analytics and automated content generation workflows.
Features
- AI Pipeline Orchestrators - A development environment for building automated workflows that connect language models with live data sources for intelligent content generation.
- Declarative Pipeline Construction - Defines complex data transformation workflows as a static graph of operations that the engine optimizes before execution.
- Exactly-Once Processing Semantics - Guarantees that every input record is accounted for exactly once through robust checkpointing and deterministic operator execution logic.
- Data Processing Pipelines - Run high-performance data transformation tasks using a unified engine that handles both batch and streaming sources while ensuring every record is processed exactly once.
- Data Stream Processors - Execute complex data transformations by running batch or real-time tasks through a unified engine that maintains full compatibility with standard programming environments for analytics and events.
- Differential Dataflow Engines - Processes data updates incrementally by tracking changes through a directed graph of operators to ensure consistent real-time results.
- Stream Processing Engines - A high-performance data processing framework that executes complex transformations on both batch and real-time streaming data sources with consistent logic.
- Unified Batch-Stream Processing Engines - Executes identical logic for both static datasets and continuous event streams by treating batch data as a finite stream.
- Unified Batch and Stream Processors - Developing data applications that handle both static historical records and live incoming events using a single, consistent programming model.
- Stream Processing Engines - [](#event-processing-and-real-time-analytics-pipelines)
- Enterprise RAG Frameworks - Constructing robust retrieval-augmented generation systems that process, chunk, and rerank documents to provide accurate answers from private data stores.
- Vector-Aware Data Ingestion - Integrates embedding generation and document chunking directly into the pipeline to maintain synchronized searchable indexes for language models.
- Language Model Connectors - Connect data pipelines to external text generation and embedding services by applying model wrappers to specific columns containing prompts for automated content processing.
- Vector Data Ingestion Frameworks - A specialized toolset for automating the creation and real-time updating of searchable document indexes within large-scale data processing pipelines.
- Streaming Data Pipelines - Convert static batch processing pipelines into continuous streaming workflows by updating input connectors while preserving the underlying logic used to transform your data.
- Vector Search Indexes - Automating the creation and maintenance of searchable document indexes that update instantly as new data arrives from external sources.
- Feature Flagging Systems - [](#features)
- Incremental State Management - Maintains intermediate computation results in memory to avoid recomputing entire pipelines when only a small portion of data changes.
- Real-Time AI Pipelines - Building automated workflows that connect live data streams to language models for instant, context-aware content generation and analysis.
- Distributed Data Platforms - A deployment-ready infrastructure for scaling containerized data services across local and cloud environments with reliable performance and network connectivity.
- Reranking Engines - Improve search accuracy by evaluating the relevance of retrieved documents against user queries using reranking models to filter and select the most significant information.
- AI Pipelines - [](#ai-pipelines)
- Data Application Deployment - Deploy data processing services into local or cloud environments by using containerized images or standard execution methods that ensure your software scales reliably across different infrastructure setups.
- Vector Document Indexing - Automate the creation of searchable document indexes that update in real-time by integrating external vector database clients directly into your data ingestion workflows.
- Document Chunking Utilities - Convert raw files into structured text and divide large documents into smaller, manageable segments using specialized parsers and token-based splitters for improved model performance.
- Web Service Deployments - Deploy containerized web applications to cloud hosting environments by linking your source code repository and defining the necessary network port configurations for public access.