30 open-source projects similar to splware/esproc, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best EsProc alternative.
Pentaho Kettle is an enterprise ETL data integration platform designed to extract, transform, and load data between disparate sources and target databases. It functions as a metadata-driven orchestrator that utilizes a visual workflow designer to create and manage complex sequences of data tasks and transformation pipelines. The system is distinguished by its distributed data processing engine, which executes workloads across clusters of server nodes to increase throughput. It employs a plugin-based architecture, allowing the platform to be extended via external JAR files to provide connectiv
Otter is a distributed database synchronization system and change data capture tool designed to replicate data between databases across multiple geographic regions. It functions as a synchronization orchestrator and ETL data pipeline that mirrors records and associated files in real time. The system employs incremental log parsing to capture database changes and utilizes a consistency-based convergence algorithm and loop-avoidance logic to manage bi-directional replication. It processes data through a pipeline of selection, extraction, transformation, and loading to handle joins and format co
ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring. The platform distinguishes itself through ad
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
JRuby is a Ruby language implementation that runs on the Java Virtual Machine. It serves as a cross-language runtime and execution environment, allowing Ruby code to run on the JVM and share memory with Java applications. The project functions as a bridge between Ruby and Java, enabling Ruby scripts to call Java classes and libraries directly. It also provides a mechanism to embed a Ruby interpreter into Java applications to allow for dynamic scripting. The runtime leverages the JVM for system scalability and ensures a consistent execution environment across different operating systems.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Unstract is an unstructured data extraction system and ETL pipeline orchestrator that uses large language models to convert documents, images, and scans into structured JSON. It provides a document extraction API for integrating these capabilities into external automation tools and includes a Model Context Protocol server to connect AI agents to structured information retrieval. The system ensures data accuracy through a verification tool featuring dual-model verification and human-in-the-loop review with coordinate-based document highlighting. It utilizes natural language extraction schemas
Octosql is a federated SQL query engine, data transformer, and streaming SQL processor. It allows users to execute single SQL statements across multiple disparate data sources, including different database types and file formats, to merge and transform results into a unified set. The system distinguishes itself by treating CSV, JSONLines, and Parquet files as virtual tables and utilizing a plugin-based architecture to extend connectivity to external storage engines. It functions as a streaming processor for infinite data streams, using watermarks, retractions, and tumbling windows to maintain
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
dlt is a Python data ingestion tool and ETL pipeline framework designed to fetch data from diverse sources and persist it into structured destinations. It functions as a schema inference engine that automatically detects data types and flattens nested JSON structures into relational tables, moving data from sources to lakehouses, warehouses, or vector databases. The project distinguishes itself through AI-powered pipeline generation, using large language models to scaffold extraction code and connectors for REST APIs. It also supports multimodal vector storage and specialized population of ve
Mage AI is a Python-based data pipeline orchestrator and self-hosted data integrated development environment. It is designed for building, scheduling, and monitoring data workflows using a block-based pipeline design and interactive notebook interface. The platform distinguishes itself by integrating generative AI capabilities, allowing users to connect large language model providers via API to incorporate artificial intelligence into automated data streams. It also functions as an Apache Spark data processor, managing the kernels and infrastructure required for high-volume analytics and larg
This project is a collection of big data frameworks and pipelines, including an Apache Hive analysis framework, a behavioral data analytics platform, a predictive analytics engine, and real-time data pipelines. It provides the infrastructure for building Extract, Transform, Load (ETL) workflows to process large datasets for distributed storage and SQL-based analysis. The system supports diverse analytical implementations, such as a predictive engine using linear regression for value forecasting and a real-time architecture that moves data through message brokers for immediate reporting. It in
This project is an educational resource and technical manual for Apache Spark, focused on the architecture and practical application of large-scale data processing. It serves as a guide for big data engineering and distributed computing, covering the principles of parallel processing and fault-tolerant data distribution. The material provides instructional content on designing distributed ETL pipelines and implementing data analysis workflows. It includes tutorials for polyglot data processing, offering patterns and examples for using Python, Scala, and Java within a unified environment. The
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
ChatLab is a self-hosted chat database and data pipeline designed to normalize, store, and analyze large-scale social conversation histories. It functions as an analytics platform that uses large language models to extract patterns and insights from messaging data imported from multiple platforms. The system distinguishes itself through an AI-powered analysis engine that utilizes vector-based history analysis and agent-based function calling to summarize conversation trends. It further identifies behavioral patterns by generating visual analytics, including heatmaps, word clouds, and activity
Light Task Scheduler is a distributed job scheduling and workflow orchestration platform designed for managing background processing across scalable computing environments. It functions as a cluster management system that coordinates stateless nodes to execute recurring, cron-based, or one-time tasks with centralized control and high availability. The platform distinguishes itself through a leader-based coordination model that automatically elects a primary controller to manage task distribution and system state. It supports complex workflow dependencies, ensuring that prerequisite tasks comp
Nano is a distributed application framework designed for building systems using an actor-based messaging model. It functions as a distributed actor framework that decouples components through asynchronous messaging to maintain state isolation across a server cluster. The system acts as a cluster message dispatcher and session-aware request router, tracking client state to route incoming messages to the specific agent holding the session data. It utilizes a distributed agent registry to coordinate the dispatching of messages between multiple application instances acting as agents. The framewo
Perfetto is a platform for system-level performance tracing and analysis on Linux and Android. It combines a high-throughput trace recorder, a SQL-based query engine, and a browser-based visualizer into a single toolchain. The platform covers CPU scheduling and call-stack profiling, native and Java heap memory allocation tracking, GPU and graphics events, and system-wide counters such as CPU frequency and power consumption. The architecture decouples trace recording from offline analysis, using a compact protobuf format for event encoding and columnar storage for efficient SQL queries. The we
Lunatic is a WebAssembly runtime and concurrent process manager that implements an Erlang-inspired model of lightweight concurrency and fault tolerance. It functions as a distributed actor system where isolated processes communicate via message passing across a network of linked nodes. The system utilizes a WebAssembly sandbox environment to isolate memory and restrict system call permissions for each individual process. This capability-based security model ensures that processes are sandboxed to safely execute untrusted code. The platform provides a fault-tolerant supervision tree for hiera
Azure Docs is the official technical documentation repository for Microsoft Azure, the cloud computing platform. It provides comprehensive guidance on the full spectrum of Azure services, covering everything from core infrastructure components like virtual machines, Kubernetes clusters, and serverless computing to platform services for AI, machine learning, data analytics, and storage. The documentation details how to provision, manage, and govern cloud resources at scale, including policy enforcement, identity management, and cost optimization. The documentation distinguishes Azure through i
Emitter is a distributed pub-sub platform and message broker that provides real-time data routing between publishers and subscribers across a distributed cluster. It functions as an MQTT message broker for low-power devices and a WebSocket communication server for web-based clients, while acting as a secure channel orchestrator to manage encrypted data streams. The system distinguishes itself through a combination of distributed broker clustering for high availability and a persistence-backed message playback system. This allows the platform to store historical messages and deliver them to su
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Cayley is a graph database and query engine designed to store and retrieve interconnected data. It functions as a quad store, persisting information as four-element tuples to maintain complex relationships and semantic linked data. The system features a backend-agnostic storage layer that decouples the graph API from the underlying data store. This allows for the integration of external backends through a modular adapter system, enabling the synchronization of data across different storage engines. The project provides a pattern-matching query engine for extracting specific nodes and relatio
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
LXD is a system container manager and virtual machine manager that provides a unified interface for running full Linux systems. It acts as a container cluster orchestrator, an image format converter, and an infrastructure manager that exposes control through a REST API and language-specific SDKs. The project distinguishes itself by providing a unified container and virtual machine abstraction, treating both as generic instances within a single management layer. It supports distributed cluster coordination to synchronize state and distribute workloads across multiple physical nodes. The syste
Apache Druid is a real-time analytics database and distributed columnar time-series store designed for sub-second analytical queries. It functions as a data platform featuring a distributed SQL query engine and a real-time data ingestion system for moving historical and streaming data from external sources. The system is distinguished by its ability to provide low-latency analytics under high concurrency to power operational dashboards. It implements a Kerberos-secured environment for user authentication and employs a shared-nothing cluster architecture to enable horizontal scaling. The plat
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
This project is a comprehensive pandas data analysis tutorial and instructional guide designed for learning data manipulation and analysis. It serves as a tabular data processing guide and a manual for time series analysis, providing a structured approach to cleaning, merging, and transforming datasets. The repository functions as a data feature engineering course, providing tutorials on constructing and selecting dataset features to improve machine learning model performance. It also includes a vectorized data operations guide for performing element-wise mathematical computations and matrix
xmall is a distributed e-commerce platform based on a service-oriented architecture. It separates business logic into independent services that communicate over a network to ensure scalability and fault tolerance, utilizing a decoupled storefront interface for customer transactions. The platform employs a distributed architecture using Dubbo for service orchestration and Zookeeper for cluster coordination and service discovery. It integrates a specialized set of components including an asynchronous message broker for background tasks, an indexed search system for product catalogs, and a centr