30 open-source projects similar to opendcai/dataflow, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best DataFlow alternative.
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
Apache NiFi is a flow-based programming platform that enables the visual design, monitoring, and management of data pipelines. At its core, it provides a web-based visual dataflow designer where users build directed graphs of processors to route, transform, and mediate data movement between any source and destination without writing custom code. The system records fine-grained data provenance for every data item from ingestion to delivery, supporting audit, debugging, and replay of data lineage. The platform distinguishes itself through a zero-master cluster architecture that distributes proc
Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin
Beads is a versioned, dependency-aware graph database designed for distributed issue tracking and project management. It functions as an agentic workflow orchestrator, providing a structured environment where tasks, dependencies, and project metadata are linked through relational hierarchies. By maintaining a persistent, version-controlled record of project state, the system enables teams to manage complex work items across multiple repositories and environments. The platform distinguishes itself through its deep integration with automated coding agents, acting as a Model Context Protocol ser
This project is a platform that orchestrates multiple AI agents to automate data science workflows—covering data loading, cleaning, feature engineering, modeling, and querying. It also functions as a natural language database query interface, converting plain English questions into SQL, and as a visual data pipeline builder. Custom agents are generated on demand by filling prompt templates for tasks like data cleaning and feature engineering. Pipelines incorporate human-in-the-loop checkpoints that pause execution for review and approval. Intermediate results are saved as versioned files, ena
Orchest is a data pipeline orchestrator and containerized workflow manager. It provides a platform for designing, scheduling, and executing complex data processing sequences through a combination of a graphical interface and scripting. The platform distinguishes itself by using containers to manage software dependencies, ensuring consistent execution across different environments. It features a polyglot task scheduler capable of triggering jobs written in multiple programming languages and includes a version control system that tracks historical snapshots of project configurations and code.
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
ChatDev is an automated software engineering platform that orchestrates the end-to-end development lifecycle through a multi-agent framework. It functions as a programmable engine that coordinates specialized autonomous agents to handle design, coding, testing, and documentation tasks by transitioning through predefined phases of a software project. The system distinguishes itself by using role-based agent specialization to simulate a professional engineering team, assigning distinct personas and knowledge bases to individual agents. It employs prompt-driven task decomposition to break high-l
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
UltraRAG is an LLM RAG orchestration platform and AI agent research framework designed to coordinate complex retrieval-augmented generation workflows. It functions as a multimodal RAG engine capable of retrieving and generating responses using text, images, and diverse data types, while providing tools for vector database management and RAG performance evaluation. The platform features a visual RAG pipeline builder that uses a canvas interface to construct and debug data flows, synchronizing visual designs directly with underlying code. It distinguishes itself through an autonomous research s
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
This project is an AI-powered IDE extension and LLM coding assistant that provides a conversational interface for generating, refactoring, and debugging code. It functions as an AI agent framework and a Model Context Protocol client, connecting AI models to external data sources and tools to automate complex development tasks. The system is distinguished by its use of autonomous AI agents capable of multi-step task execution, including the ability to read files, modify code, and run terminal commands iteratively. It supports recursive agent orchestration through subagent delegation and employ
This project is a training pipeline and framework for developing Chinese language models based on the Llama 2 architecture. It functions as a distributed GPU trainer and dataset preprocessing toolkit designed for both the initial pre-training of baseline models and subsequent supervised fine-tuning. The system distinguishes itself through a specialized workflow for Chinese text, incorporating a data curation pipeline that uses similarity hashing for deduplication and a tokenization process that converts raw text into memory-mapped binary files for efficient disk access. It implements a superv
AdalFlow is an autonomous AI agent framework and LLM application library designed for building modular workflows. It serves as a model-agnostic interface and RAG pipeline orchestrator, allowing users to develop ReAct agents that utilize iterative reasoning and external tool execution to solve complex tasks. The project distinguishes itself through a prompt optimization system that uses textual gradient descent to automatically refine prompt templates and few-shot examples. It treats model feedback as a differentiable signal, enabling a form of LLM backpropagation to iteratively improve output
Kilocode is an autonomous engineering platform designed to orchestrate AI agents for complex software development tasks. It functions as a comprehensive system for automating coding, testing, and repository management by integrating directly with your codebase and terminal. The platform provides a unified gateway for model orchestration, allowing for the management of agentic workflows, event-driven automation, and persistent session state across distributed development environments. The platform distinguishes itself through its federated task management and policy-based access control, which
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
The synthetic data kit is an integrated framework designed to generate, curate, and format training datasets for language models. It provides an end-to-end pipeline that transforms raw source documents into structured data suitable for fine-tuning, reasoning, and tool-use model training. The framework distinguishes itself through a modular orchestration engine that manages the entire lifecycle of data preparation. It supports multimodal input by extracting both text and image content from various file formats, while employing context-aware chunking to maintain semantic coherence. The generati
This project is a collection of deep learning research implementations and a reproduction kit designed to translate theoretical AI papers into working code. It provides a library of neural network architectures and reference implementations for reproducing seminal research concepts through interactive notebooks. The repository distinguishes itself through the implementation of AI theory and scaling laws, covering complexity dynamics, information theory, and the simulation of universal AI agents. It also includes a benchmarking suite for synthetic reasoning, allowing for the evaluation of mode
Langroid is a multi-agent orchestration framework and tool integration suite designed for building complex AI applications. It serves as a multi-modal integration layer that connects diverse local and remote language models with an agentic retrieval-augmented generation system. The project distinguishes itself through a collaborative message-exchange paradigm, allowing specialized agents to delegate tasks hierarchically and coordinate via structured communication. It features an advanced state management system for conversational AI, including the ability to rewind and prune conversation hist
Easy-dataset is a comprehensive platform designed for the end-to-end management of machine learning datasets, specifically tailored for language and vision model fine-tuning. It functions as a centralized environment for the entire data lifecycle, encompassing the automated generation of synthetic training data, the structural organization of document collections, and the systematic annotation of individual data points. The platform distinguishes itself through its integrated evaluation and orchestration capabilities. It provides a dedicated suite for benchmarking models, featuring blind side
mcp-context-forge is a Model Context Protocol federation gateway that unifies diverse AI tool servers and APIs into a single consistent interface for discovery and execution. It acts as a centralized proxy that aggregates multiple servers and APIs, allowing AI agents to access and invoke a unified set of tools, prompts, and resources. The project distinguishes itself through a multi-protocol translation bridge that converts communication between standard I/O, SSE, gRPC, and REST to enable interoperability between disparate tool servers. It includes a comprehensive LLM evaluation framework for
This project serves as a comprehensive educational resource and technical handbook for engineers building applications powered by large language models. It provides a structured framework for mastering the principles of artificial intelligence engineering, covering the full lifecycle of model development from initial design to production deployment. The repository distinguishes itself by offering a deep dive into the practical implementation of advanced design patterns, including retrieval-augmented generation, agentic tool orchestration, and parameter-efficient model adaptation. It emphasize
UltraChat is a collection of large-scale conversational datasets and instruction-tuning data designed for training and evaluating generative AI models. It provides structured JSON data consisting of complex, multi-round dialogue sequences intended to refine the performance of large language models in chat tasks. The project focuses on improving reasoning and response quality through a diverse set of interactions across multiple sectors. These datasets are used for supervised fine-tuning and instruction tuning workflows to improve how models follow complex directions and maintain context acros
This project provides a comprehensive Chinese language corpus designed to support the training and fine-tuning of large language models. It serves as a structured natural language processing resource, offering a collection of text data that includes dialogue, customer service interactions, and creative writing. The dataset is organized into distinct thematic categories, allowing for targeted model development across specific conversational and narrative contexts. By providing information in standardized, schema-agnostic text formats, the collection ensures portability across various machine l
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Streem is a stream-based programming language and data pipeline orchestrator. It provides a domain-specific language for defining concurrent data flows, allowing users to link data sources to destinations through a sequence of operations that transform and filter individual stream elements. The system uses a custom script syntax to define data-flow connections and pipeline definitions. This allows for the orchestration of concurrent data processing where multiple pipeline stages execute simultaneously to move data elements through the system. The platform covers functional data transformatio
Archon is an artificial intelligence agent automation engine designed to orchestrate complex development workflows. It functions as a platform for chaining multi-step tasks into directed graphs, allowing developers to standardize and execute repeatable coding patterns through declarative configuration files. The system distinguishes itself by maintaining stateful context across long-running sessions and executing operations within isolated, containerized worktrees to prevent file interference. It integrates with external language models and provides a centralized registry for sharing and inst
This project provides a high-resolution face dataset consisting of 70,000 human face images in PNG format. It serves as a curated library of aligned images and facial landmark data designed for generative model training, facial recognition, and image synthesis research. The dataset includes machine-readable metadata that pairs images with precise facial coordinate points, source URLs, and copyright information. This coordinate data enables the transformation of raw photos into a standardized 1024x1024 pixel resolution through landmark-based alignment and cropping. The repository includes aut