These open-source frameworks create artificial datasets that maintain the statistical properties and privacy of original data.
This is a generative AI model library containing a collection of PyTorch and TensorFlow implementations for creating synthetic data and modeling complex probability distributions. It serves as a multi-framework repository of deep learning models designed for learning and replicating data patterns. The project provides specialized implementation suites for several generative architectures. This includes Generative Adversarial Networks using competing generator and discriminator models, Variational Autoencoder frameworks that map data to a latent space, and Restricted Boltzmann Machine and Deep Belief Network implementations. The library covers broad capabilities in probabilistic data modeling and unsupervised representation learning. It includes tools for synthetic data generation and the use of energy-based networks to model binary data distributions.
This repository provides a collection of generative model implementations like GANs and VAEs that can be used to learn and replicate data distributions for synthetic data generation, though it functions more as a library of model architectures than a turnkey tool for tabular data pipelines.
Easy-dataset is a comprehensive platform designed for the end-to-end management of machine learning datasets, specifically tailored for language and vision model fine-tuning. It functions as a centralized environment for the entire data lifecycle, encompassing the automated generation of synthetic training data, the structural organization of document collections, and the systematic annotation of individual data points. The platform distinguishes itself through its integrated evaluation and orchestration capabilities. It provides a dedicated suite for benchmarking models, featuring blind side-by-side human testing and automated grading to ensure objective performance metrics. Users can orchestrate complex data pipelines that transform raw documents into structured formats through recursive segmentation, automated taxonomy classification, and customizable text refinement. Beyond core generation and management, the system supports a wide range of data processing tasks, including visual document extraction, content augmentation, and the creation of multi-turn conversational datasets. It offers flexible configuration for model connections and generation parameters, allowing for fine-grained control over output quality and consistency. The platform is designed for local deployment to maintain data privacy and security. It includes built-in tools for programmatic quality assessment and supports the export of processed datasets into standard formats compatible with various fine-tuning pipelines.
This platform provides end-to-end synthetic data generation and management for machine learning pipelines, though it focuses more on LLM-based text and document augmentation than on preserving the statistical distributions of tabular datasets.
DataFlow is an agent-based workflow orchestrator and data pipeline designed to synthesize, clean, and augment large-scale datasets for training large language models. It functions as a synthetic data generator and text curation tool, utilizing an intelligent assistant to assemble modular processing operators into functional pipelines based on user requirements. The project distinguishes itself through a low-code approach, providing a web-based visual interface for designing and monitoring multi-stage execution flows. It features an operator-based registry system that allows for the integration of third-party components and a guided command-line process to bootstrap distributable operator libraries. The platform covers a broad range of data engineering capabilities, including unstructured knowledge extraction, text deduplication via MinHash, and noise filtering. It specifically supports the generation of complex reasoning chains, multi-hop question-answer pairs, and synthetic SQL datasets with integrated validity filtering and difficulty evaluation. The system is implemented in Python.
DataFlow is a synthetic data generator focused on LLM training pipelines and reasoning chains, providing a visual interface to orchestrate the creation of complex synthetic datasets.
LimiX is a tabular foundation model and a suite of tools for structured data, providing a transformer-based system for classification, regression, and data generation. It includes a causal inference engine to determine cause-and-effect relationships, a synthetic data generator, and a framework for filling missing dataset values through feature context prediction. The project optimizes tabular inference through a high-performance system that uses ensemble-based sample retrieval to increase prediction speed and accuracy on high-specification hardware. It further distinguishes itself by using transformer-based encoding and masked-feature pretraining to learn data distributions. The system covers a broad range of analytical capabilities, including high-dimensional vector embedding for categorical separation and the creation of synthetic samples via causal-graph data generation. Its predictive surface extends to specific applications such as electricity market price forecasting and the analysis of molecular properties in organic molecules.
LimiX is a tabular foundation model that includes a dedicated synthetic data generator capable of creating samples while preserving statistical distributions and causal relationships within structured data.
This project provides an end-to-end framework for adapting large language models to follow user instructions through supervised fine-tuning. It functions as a comprehensive training pipeline that enables the creation of specialized assistant models by minimizing the difference between predicted outputs and target responses within structured instruction datasets. The framework distinguishes itself by integrating synthetic data generation with memory-efficient training techniques. It utilizes powerful language models to iteratively expand small sets of human-written seeds into diverse, high-quality instruction-response pairs, significantly reducing the cost of data acquisition. Furthermore, it employs parameter-efficient adaptation methods, such as low-rank matrix decomposition, to update model weights with minimal computational overhead. The toolkit also includes utilities for model weight reconstruction, allowing users to apply calculated parameter offsets to base model checkpoints. This approach enables the distribution and deployment of fully functional fine-tuned models without the need to share large, complete weight files. The repository provides the necessary scripts, data generation pipelines, and evaluation procedures to support the reproduction and development of instruction-following workflows.
This project is a framework for fine-tuning large language models using instruction-following datasets, rather than a tool designed to generate privacy-preserving synthetic datasets that mirror the statistical properties of tabular source data.