30 open-source projects similar to nucleuscloud/neosync, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Neosync alternative.
Mimesis is a Python synthetic data generator used to create realistic fake datasets and mock data for software testing and development. It functions as a schema-based dataset generator capable of producing structured records and relational datasets, while also serving as a production data anonymizer to replace sensitive information with synthetic values. The library distinguishes itself through comprehensive multilingual support, allowing for the generation of locale-specific information to simulate regional user profiles. It ensures reproducibility through deterministic data generation using
Replibyte is a tool that automates the lifecycle of database snapshots for non-production environments, handling the export, anonymization, subsetting, and restoration of data. It is designed to support privacy-compliant development workflows by replacing sensitive production data with synthetic values and extracting consistent subsets of rows while preserving referential integrity. The tool operates through a configurable pipeline defined in a YAML file, orchestrating stages such as dump, anonymize, subset, and restore. Each operation runs as an isolated, ephemeral container job, and snapsho
Presidio is a PII detection and anonymization framework designed to identify and mask personally identifiable information in text. It functions as a PII recognition pipeline and a data masking engine, using a combination of machine learning, regular expressions, and rule-based logic to locate sensitive entities. The system acts as an NER model orchestrator, allowing for the integration of external named entity recognition models and PII detectors to support multi-language privacy scrubbing. It employs a plugin-based recognizer architecture that can be extended with custom recognizers, deny-li
Jailer is a suite of specialized tools for AI-assisted SQL management, referential integrity preservation, and relational data browsing. It provides a system for generating referentially intact database subsets, allowing users to extract consistent slices of relational data while preserving foreign key constraints and dependencies. The project features an AI-driven SQL assistant that uses natural language to generate, optimize, and refactor queries based on database schemas. It also includes a data migration tool that analyzes SQL patterns to reverse engineer models and map associations betwe
This is a generative AI model library containing a collection of PyTorch and TensorFlow implementations for creating synthetic data and modeling complex probability distributions. It serves as a multi-framework repository of deep learning models designed for learning and replicating data patterns. The project provides specialized implementation suites for several generative architectures. This includes Generative Adversarial Networks using competing generator and discriminator models, Variational Autoencoder frameworks that map data to a latent space, and Restricted Boltzmann Machine and Deep
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Apache NiFi is a flow-based programming platform that enables the visual design, monitoring, and management of data pipelines. At its core, it provides a web-based visual dataflow designer where users build directed graphs of processors to route, transform, and mediate data movement between any source and destination without writing custom code. The system records fine-grained data provenance for every data item from ingestion to delivery, supporting audit, debugging, and replay of data lineage. The platform distinguishes itself through a zero-master cluster architecture that distributes proc
Kedro is a data science pipeline framework and production toolbox designed to build reproducible, modular workflows using software engineering best practices. It functions as a data engineering orchestrator and catalog manager, bridging the gap between interactive analysis and maintainable production pipelines. The framework distinguishes itself by using a data catalog to decouple data access from processing logic and providing tools to transition analysis from interactive notebooks into structured workflows. It includes a workflow visualization tool that generates visual maps of data pipelin
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models using external storage and metadata pointers. It integrates with Git by utilizing placeholders to keep heavy artifacts out of the repository while maintaining a versioned link between code and data. The system manages remote data caches through a synchronization layer that connects local environments to cloud storage or network filesystems. It also functions as an experiment tracker, recording hyperparameters and metrics to compare the performance of different model iterations.
Streem is a stream-based programming language and data pipeline orchestrator. It provides a domain-specific language for defining concurrent data flows, allowing users to link data sources to destinations through a sequence of operations that transform and filter individual stream elements. The system uses a custom script syntax to define data-flow connections and pipeline definitions. This allows for the orchestration of concurrent data processing where multiple pipeline stages execute simultaneously to move data elements through the system. The platform covers functional data transformatio
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Faker is a PHP library for creating realistic synthetic data used for testing, prototyping, and populating database entities. It serves as a test data generator and localized mocking tool capable of producing synthetic names, addresses, and identifiers specific to various countries and languages. The library provides mechanisms to ensure data consistency and quality, including deterministic seeding to produce identical data sequences across executions and stateful uniqueness tracking to prevent duplicate values. It also supports probability-weighted optionality to simulate missing data and cu
This project is a synthetic data generator designed to create realistic tabular and time-series datasets for machine learning and testing workflows. It functions as a privacy-preserving platform that models the underlying statistical distributions of source data to produce new records that maintain the original statistical properties and structural integrity. The tool distinguishes itself by utilizing CPU-optimized statistical sampling, allowing for high-performance data generation on standard hardware without the need for specialized graphics processing units. It employs a configuration-driv
Mage AI is a Python-based data pipeline orchestrator and self-hosted data integrated development environment. It is designed for building, scheduling, and monitoring data workflows using a block-based pipeline design and interactive notebook interface. The platform distinguishes itself by integrating generative AI capabilities, allowing users to connect large language model providers via API to incorporate artificial intelligence into automated data streams. It also functions as an Apache Spark data processor, managing the kernels and infrastructure required for high-volume analytics and larg
Orchest is a data pipeline orchestrator and containerized workflow manager. It provides a platform for designing, scheduling, and executing complex data processing sequences through a combination of a graphical interface and scripting. The platform distinguishes itself by using containers to manage software dependencies, ensuring consistent execution across different environments. It features a polyglot task scheduler capable of triggering jobs written in multiple programming languages and includes a version control system that tracks historical snapshots of project configurations and code.
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Joyagent-jdgenie is an automated data orchestrator designed to centralize the retrieval and processing of information from disparate remote sources. It functions as a framework for building repeatable data pipelines that fetch, clean, and normalize raw input into consistent, structured formats. The system utilizes a schema-driven engine to apply validation rules and structural templates to incoming data, ensuring compatibility across enterprise systems. By employing configuration-based workflow definitions, it allows for the orchestration of modular tasks into automated execution flows, separ
Airbyte is a data integration platform designed to synchronize information between diverse applications, databases, and data warehouses. It functions as an extract, transform, and load orchestrator that manages automated data movement workflows across cloud, on-premise, and hybrid environments. The platform provides a standardized interface for connectors, enabling the movement of structured and unstructured data while maintaining stateful checkpoints for reliable incremental syncing. The platform distinguishes itself through a containerized architecture that isolates connectors to prevent de
DolphinScheduler is a distributed workflow orchestrator designed to manage and automate complex data processing pipelines. It functions as a data pipeline scheduler that coordinates multi-step tasks across distributed environments, ensuring reliable execution through defined dependencies and sequences. The platform utilizes a directed acyclic graph model to represent workflows, allowing users to define task relationships via a visual interface. It employs a master-worker architecture supported by a pluggable task plugin system, which enables the dynamic extension of task types without requiri
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Azkaban is a distributed workflow manager and DAG-based job orchestrator designed as an enterprise batch processor. It serves as a Java-based workflow engine that schedules and executes complex job sequences across a cluster of executor servers, with specific functionality for managing big data workloads on Hadoop clusters. The system distinguishes itself through a distributed executor model that coordinates state via a shared database to ensure high availability. It employs a plugin-based architecture that allows for custom job types and system functionality extensions, including the ability
Airflow is a workflow orchestration platform for authoring, scheduling, and monitoring complex data pipelines as code using Python. It employs a DAG-based task scheduler to manage execution timing and dependencies via directed acyclic graphs, utilizing a distributed task execution engine to run workloads across a cluster of worker nodes. The platform provides a data pipeline monitor for tracking the health and execution history of programmatic workflows. This includes a web interface for workflow progress visualization and health monitoring to identify and troubleshoot pipeline failures. The
TinyTroupe is a multi-agent simulation framework designed to create populations of persona-based agents that interact to generate synthetic behavioral data and business insights. It serves as a persona-based agent orchestrator and synthetic data generator, allowing for the definition of agents with specific personality traits and goals to coordinate their interactions through structured workflows. The project features an extensible plugin system for connecting simulated agents to external tools and servers to execute code and access remote data. It includes an agentic simulation dashboard tha
Java-faker is a synthetic data generator and mock data library for Java applications. It provides utilities to create randomized, believable fake records such as names and addresses to populate test environments and verify application logic without using real user information. The library specializes in localized data generation, producing synthetic content tailored to specific languages and regional formats. This allows for the verification of application accuracy across different global locales. The tool covers broad capabilities for automated testing mocking, including the generation of m
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
This project is a streaming data integration framework that captures real-time database changes and synchronizes them with downstream systems. It operates as a distributed streaming ETL and database synchronizer, reading database logs and snapshots to propagate row-level modifications to target sinks. The system supports declarative data integration, allowing users to define source-to-sink data flows using SQL or YAML configurations. It distinguishes itself by automating schema evolution to maintain synchronization when source structures change and ensuring exactly-once delivery and processin
This project is a relational SQL sample database and synthetic testing dataset. It provides a standardized data model of a fictional digital media store, encompassing business entities such as artists, albums, tracks, customers, and invoices. The dataset is designed as a cross-dialect SQL collection, using compatible scripts to ensure consistent data seeding and environment parity across different database server engines. It combines imported metadata with fictitious personal details to create realistic records for software prototyping and demonstrations. The project covers capabilities for
Recommenders is a recommendation system framework designed for building, benchmarking, and deploying collaborative and content-based filtering models. It provides a machine learning model pipeline that standardizes the process of moving recommendation data from raw ingestion through training and evaluation. The project functions as a model benchmarking toolkit, utilizing standardized ranking and error metrics to compare the accuracy of different algorithms. It also serves as a hyperparameter tuning tool, allowing for the optimization of model behavior and performance via external configuratio