18 个仓库
Logic for modifying the structure and values of specific data columns within a dataset.
Distinct from Field Manipulation APIs: Closet candidates are either UI-focused or narrow API methods; this is a general data processing capability.
Explore 18 awesome GitHub repositories matching data & databases · Field Transformations. Refine with filters or upvote what's useful.
Keystone Classic is a Node.js headless content management system and web application framework. It provides a database schema framework for defining structured data models and validation rules to organize information. The system automatically generates a responsive administrative dashboard based on predefined data models and database fields, allowing for content management and record editing without custom administration code. The framework covers identity and security through session state management and password encryption. It includes capabilities for request routing, form submission proc
Allows modifying or formatting data using specialized methods before it is saved to or retrieved from the database.
Miller is a command-line data processor used for filtering, transforming, and aggregating name-indexed tabular data. It functions as a tool for querying and reshaping records across multiple file formats, serving as a converter between CSV, JSON, and YAML. The tool distinguishes itself by using a name-indexed data model, allowing users to manipulate fields by name rather than numeric position. It utilizes single-pass streaming algorithms to compute statistics and summaries on large datasets that exceed available system memory. Its capabilities cover data transformation and analysis, includin
Modifies datasets by removing unwanted columns or calculating new fields using logical expressions.
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Supports renaming or replacing specific fields within a record to align source schemas with destination requirements.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Applies user-defined mapping functions to modify, enrich, or clean individual dataset fields.
csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data. The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without req
Displays column names, data types, and sample values to help understand a CSV file's structure.
pgloader is a command-line tool that automates the migration of data and schema from various source databases and file formats into PostgreSQL. It combines schema discovery, parallel data pipelines, and type casting into a single, declarative workflow, using PostgreSQL's COPY protocol for high-throughput bulk loading. The tool distinguishes itself by compiling a dedicated command language into concurrent reader-writer pipelines that handle schema introspection, data transformation, and error-resilient batch processing. It supports migrating entire databases from MySQL, MS SQL, SQLite, and Pos
Applies per-column options such as date format parsing, null-value substitution, and whitespace trimming during CSV loading.
RediSearch is a Redis module that adds secondary indexing, full-text search, aggregation, and vector similarity search directly into the in-memory data store. It operates as an in-process search engine, extending the core key-value store with capabilities for indexing hash and JSON documents, enabling fast field-level lookups beyond primary key access. The module provides a full-text search engine built on inverted indexes, supporting stemming, fuzzy matching, and relevance scoring via tf-idf. It also includes a vector similarity search engine using a Hierarchical Navigable Small World graph
Computes new field values from existing ones using arithmetic expressions and built-in functions in the aggregation pipeline.
attrs is a Python library that automatically generates initialization, representation, equality, hashing, and ordering methods from declarative class attribute definitions. At its core, it provides a class decorator metaprogramming framework that intercepts class creation to rewrite the class body, producing dunder methods without manual boilerplate. The library includes a comprehensive attribute validation toolkit with built-in validators for type checks, range constraints, regex matching, length limits, and logical composition of validation rules. The library distinguishes itself through it
Supports generator functions as field transformers during class creation.
GluonTS is a framework for probabilistic time series forecasting, designed to predict future values as probability distributions with confidence intervals. It supports both traditional model training and zero-shot forecasting, where pretrained models generate predictions for new series without additional training. The project distinguishes itself by integrating a wide variety of forecasting approaches into a unified workflow. This includes deep learning architectures such as recurrent neural networks and causal convolutions, as well as the integration of external statistical models, the Proph
Converts date-based start fields into standardized periods using specific observation frequencies.
GluonTS 是一个概率时间序列库和深度学习预测框架。它提供了一套工具包,用于构建、训练和评估神经网络架构,通过将未来值预测为概率分布来量化不确定性。 该项目的独特之处在于支持零样本(zero-shot)预测,并集成了多种建模方法,包括深度概率神经网络以及对 Prophet 和 R forecast 等外部统计库的封装。它实现了因果卷积和可逆残差网络等专门的架构原语,以防止信息泄露并将潜在表示映射为有效的概率分布。 该框架涵盖了全面的数据工程功能,包括时间序列缩放、双射变换和分层建模。它利用 Apache Arrow 和 Parquet 进行高性能数据集流式传输和随机访问管理。在模型评估方面,它包含一套评估套件,使用分位数损失(quantile loss)和连续排名概率分数(CRPS)等指标来衡量预测准确性和概率覆盖率。 该库支持通过集成 Amazon SageMaker 进行模型部署。
Implements logic for modifying the structure and values of specific data columns within a dataset.
Vega-Lite is a high-level declarative language for specifying interactive, multi-view visualizations. It compiles a concise JSON specification into a full Vega visualization, automatically inferring scales, axes, and legends from encoding declarations. The grammar-of-graphics encoding maps data fields to visual channels such as position, color, size, and shape, while a multi-view composition grammar enables layered, faceted, concatenated, and repeated layouts. Reactive parameter binding links named parameters to input widgets, selections, and expressions for dynamic updates. The project suppo
Vega-Lite creates a new field in each data record by evaluating a formula expression against existing fields.
Mimesis 是一个 Python 合成数据生成器,用于为软件测试和开发创建逼真的虚假数据集和模拟数据。它作为一个基于模式的数据集生成器,能够生成结构化记录和关系数据集,同时也可作为生产数据脱敏工具,用合成值替换敏感信息。 该库的特色在于全面的多语言支持,允许生成特定区域的信息以模拟区域用户画像。它通过使用种子进行确定性数据生成来确保可重复性,从而在不同运行中创建一致的数据集。 该工具涵盖了广泛的合成内容,包括个人身份、财务数据、地理地址、网络元数据和科学序列。其功能扩展到通过条件逻辑和管道进行数据转换,以及与 DataFrame 和工厂模式的集成。它还支持生成标准化的系统代码、加密令牌和二进制文件模拟。 该框架可通过自定义数据提供程序和字段处理器进行扩展,允许用户集成特定领域的逻辑和外部 JSON 文件以进行专门的数据生成。
Modifies synthetic data values using functions for case conversion, padding, truncation, and encoding.
Visual Insights is an automated exploratory data analysis platform and causal inference tool designed to discover patterns and cause-and-effect relationships within datasets. It functions as an interactive data visualization library using a grammar-of-graphics approach to generate multi-dimensional charts and dashboards. The project distinguishes itself through a natural language interface that translates plain-text questions into data answers and visualizations via a language model. It provides a specialized framework for causal discovery and inference, allowing users to identify variable li
Applies transformations to fields, including encoding categorical variables and grouping time units.
该项目是一个变更数据捕获 (CDC) 系统和同步层,用于将数据从 MySQL 数据库移动到 Elasticsearch 索引中。它作为一个关系型到文档的映射器,将数据库表转换为可搜索的文档,以实现实时数据集成和全文搜索。 该同步器通过支持关系数据去规范化而脱颖而出,它将一对多数据库连接转换为父子文档结构。它还允许进行分区表聚合,使用正则表达式模式将多个数据库表分组到一个搜索索引中。 该系统涵盖了全面的数据映射和转换,包括字段类型转换、模式映射和同步字段过滤。它采用基于管道的处理模型来解码和合并字段,利用基于快照的初始加载作为基准,并利用二进制日志流进行实时更新。
Renames columns and converts data types to transform strings into arrays or integers into dates during synchronization.
NeoSync 是一个数据库同步工具和数据管道编排器,旨在跨不同环境移动和转换数据集。它作为一个 PII 数据安全平台和合成数据生成器,允许在确保隐私合规的同时同步生产数据。 该系统利用事件溯源协调器来管理异步数据移动,提供自动重试和故障处理。它通过结合基于规则的 PII 匿名化与检测,以及基于模式的合成数据生成来创建模拟生产属性而不暴露私人信息的合成数据集,从而脱颖而出。 该项目涵盖了广泛的功能领域,包括用于减少测试数据量的数据库子集化、用于重塑信息的模板驱动字段转换,以及用于在同步期间维护关系完整性的数据管道编排。
Modifies specific data columns during synchronization using predefined scripts or models to reshape information.
Baserow is a self-hosted, no-code relational database platform built on PostgreSQL. It provides a spreadsheet-like interface for structuring and managing data without writing code, while exposing all database resources via a REST API to support headless architectures. The platform distinguishes itself by integrating large language models and embedding servers to power AI assistants and automated data generation. It further extends its utility as a no-code application builder, allowing users to create custom internal portals, dashboards, and business tools using visual logic and managed data.
Creates new fields by evaluating formulas that reference and depend on other existing fields in the record.
dcat-admin 是一个 Laravel 管理面板框架,用于快速构建数据驱动的管理界面。它作为一个 CRUD 生成器和后端脚手架工具,根据数据库表模式自动生成创建、读取、更新和删除界面。 该系统通过基于插件的扩展架构以及在单个安装中运行多个独立管理实例的能力脱颖而出。它提供了将外部 API 映射到表单和表格的专用工具,以及用于在解析和提交期间执行自定义逻辑的事件驱动表单生命周期。 该框架涵盖了广泛的功能领域,包括用于管理分层权限的基于角色的访问控制、一套包含内联编辑功能的综合数据管理网格,以及多步表单工作流。它还包括用于操作仪表板的数据可视化工具,以及各种用于分块大文件上传和富文本编辑的内容处理实用程序。 提供了命令行实用程序来自动化管理组件和操作类的生成。
Transforms raw database values into visual elements like badges, hyperlinks, and images to improve data readability.
这是一个响应式状态管理库,旨在处理复杂的表单数据和验证逻辑。它利用基于观察者(observable)的模式将用户界面组件与底层数据模型同步,确保表单状态在整个应用中保持一致。该库提供了一种管理表单初始化、字段跟踪和生命周期事件的结构化方法。 该库的独特之处在于对深度嵌套数据结构和层次化组合的支持,允许在复杂对象树中进行递归验证和动态更新。它具有一个模式驱动的验证引擎,支持同步和异步规则,以及允许自定义逻辑在字段更新期间监控或转换数据的中间件式拦截。开发者可以使用基于路径的寻址动态访问和操作特定字段,在处理大型或不断演变的表单模型时提供了灵活性。 除了核心状态管理外,该库还包含用于数据转换的实用程序,例如格式化输入值和基于其他表单数据计算字段值。它提供了多表单编排功能,以协调跨多个实例的验证和提交,并且它与特定的展示层解耦,允许与任何用户界面组件库集成。该框架还提供了内置工具,用于监控字段生命周期事件和调试内部状态转换。
Cleans or transforms input values automatically, such as trimming whitespace or parsing numeric strings, before they are processed or stored.