Why is pentaho/pentaho-kettle a recommended Data Format Transformations GitHub Repositories repository?

Converts information between different file formats to ensure compatibility when moving data across disparate systems.

Why is alasql/alasql a recommended Data Format Transformations GitHub Repositories repository?

Transforms data between formats, such as reading CSV or XLSX and writing the results as JSON.

Why is bookshelf/bookshelf a recommended Data Format Transformations GitHub Repositories repository?

Parses and formats attribute values when reading from or writing to the database for data normalization.

Why is apache/pinot a recommended Data Format Transformations GitHub Repositories repository?

Applies mathematical, string, and date transformations to incoming data streams for normalization.

Why is cube2222/octosql a recommended Data Format Transformations GitHub Repositories repository?

Treats CSV, JSONLines, and Parquet files as virtual tables for analysis and transformation via SQL.

Why is turboway/bigdata_analyse a recommended Data Format Transformations GitHub Repositories repository?

Transforms raw JSON formatted source data into cleaned CSV files for downstream analytical processing.

Why is kiln-ai/kiln a recommended Data Format Transformations GitHub Repositories repository?

Converts raw input data into structured formats using templates for cleaning and reshaping.

Why is chriskacerguis/codeigniter-restserver a recommended Data Format Transformations GitHub Repositories repository?

Transforms server output into specific formats to meet the requirements of different third-party API consumers.

Why is hashicorp/consul-template a recommended Data Format Transformations GitHub Repositories repository?

Converts data structures into JSON, YAML, TOML, or base64 strings with pretty-printing.

Why is stleary/json-java a recommended Data Format Transformations GitHub Repositories repository?

Transforms data between JSON and web-specific formats such as browser cookies and comma-delimited lists.

20 个仓库

Awesome GitHub RepositoriesData Format Transformations

Tools for converting data from one structured format to another, such as CSV to JSON, using a processing engine.

Distinct from Data Formats and Parsers: Candidates are either for animation formats or generic parsers; this is about the act of transformation.

Explore 20 awesome GitHub repositories matching data & databases · Data Format Transformations. Refine with filters or upvote what's useful.

用 AI 发现最棒的仓库。我们将通过 AI 为您搜索最匹配的仓库。

pentaho/pentaho-kettle
pentaho/pentaho-kettle
8,353在 GitHub 上查看
Pentaho Kettle 是一个企业级 ETL 数据集成平台，旨在在不同源和目标数据库之间提取、转换和加载数据。它充当元数据驱动的编排器，利用可视化工作流设计器来创建和管理复杂的数据任务序列和转换管道。该系统的特点是其分布式数据处理引擎，可在服务器节点集群上执行工作负载以提高吞吐量。它采用基于插件的架构，允许通过外部 JAR 文件扩展平台，以提供与各种数据库和云服务的连接。该平台涵盖了广泛的数据集成功能，包括批量加载、远程文件管理和数据结构转换。它提供用于数据质量验证、管道自动化和作业生命周期管理的工具，以及用于跟踪服务器健康状况和实时执行状态的监控实用程序。
Converts information between different file formats to ensure compatibility when moving data across disparate systems.
Java
在 GitHub 上查看8,353
alasql/alasql
AlaSQL/alasql
7,278在 GitHub 上查看
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
Transforms data between formats, such as reading CSV or XLSX and writing the results as JSON.
JavaScript
在 GitHub 上查看7,278
bookshelf/bookshelf
bookshelf/bookshelf
6,352在 GitHub 上查看
Bookshelf is a JavaScript ORM for Node.js that provides a structured way to define and interact with database models. It centers on a model-driven approach where developers register models, define their relations, and manage data persistence through a consistent interface. The library distinguishes itself through its comprehensive handling of model relationships and data transformations. It supports defining one-to-one, one-to-many, many-to-many, and polymorphic associations, with the ability to eager load related models in a single query to avoid performance pitfalls. Bookshelf also automate
Parses and formats attribute values when reading from or writing to the database for data normalization.
JavaScript
在 GitHub 上查看6,352
apache/pinot
apache/pinot
6,098在 GitHub 上查看
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Applies mathematical, string, and date transformations to incoming data streams for normalization.
Java
在 GitHub 上查看6,098
cube2222/octosql
cube2222/octosql
5,258在 GitHub 上查看
Octosql 是一个联邦 SQL 查询引擎、数据转换器和流式 SQL 处理器。它允许用户跨多个异构数据源（包括不同类型的数据库和文件格式）执行单一 SQL 语句，从而合并并转换结果集。该系统的独特之处在于将 CSV、JSONLines 和 Parquet 文件视为虚拟表，并利用基于插件的架构扩展对外部存储引擎的连接。它作为无限数据流的流式处理器，使用水印（watermarks）、撤回（retractions）和翻滚窗口（tumbling windows）来维持乱序事件的一致性。此外，它还可用作 SQL 数据生成器，通过表值函数生成合成数据集和记录流。该引擎具备跨源数据连接和多源分析能力，并通过源端谓词下推（predicate push-down）进行优化，以减少数据传输。它通过包含联合类型的静态类型系统管理复杂数据，并提供查询执行计划可视化功能以增强可观测性。
Treats CSV, JSONLines, and Parquet files as virtual tables for analysis and transformation via SQL.
Go
在 GitHub 上查看5,258
turboway/bigdata_analyse
TurboWay/bigdata_analyse
5,238在 GitHub 上查看
This project is a collection of big data frameworks and pipelines, including an Apache Hive analysis framework, a behavioral data analytics platform, a predictive analytics engine, and real-time data pipelines. It provides the infrastructure for building Extract, Transform, Load (ETL) workflows to process large datasets for distributed storage and SQL-based analysis. The system supports diverse analytical implementations, such as a predictive engine using linear regression for value forecasting and a real-time architecture that moves data through message brokers for immediate reporting. It in
Transforms raw JSON formatted source data into cleaned CSV files for downstream analytical processing.
Pythonhqlpythonsql
在 GitHub 上查看5,238
kiln-ai/kiln
kiln-ai/kiln
4,910在 GitHub 上查看
Kiln 是一个 LLM 开发工作台和评估框架，专为设计、测试和优化提示词（Prompt）及 AI 智能体而设计。它作为一个多智能体编排器和 RAG 优化工具，为 AI 系统的迭代开发提供了可视化界面。该项目通过全面的微调流水线脱颖而出，支持零代码模型训练和推理蒸馏。它支持创建分层多智能体系统，其中专门的执行者通过工具调用进行协作，并实现了一个模型上下文协议（MCP）服务器，将这些智能体和检索能力作为标准化工具暴露给外部客户端。该平台涵盖了广泛的能力，包括用于质量保证的自动化 AI 评判评分、用于训练和评估的合成数据生成，以及用于增强模型响应的混合向量-关键词检索。它还提供了用于提示词演进、追踪审计以及通过 Git 集成管理协作数据集的工具。该工作台可通过可自托管的 REST API 和专门的 Python 库进行编程工作流执行。
Converts raw input data into structured formats using templates for cleaning and reshaping.
Python
在 GitHub 上查看4,910
chriskacerguis/codeigniter-restserver
chriskacerguis/codeigniter-restserver
4,876在 GitHub 上查看
codeigniter-restserver 是一个 REST API 框架和控制器库，用于在 CodeIgniter PHP 环境中构建 RESTful 服务器。它作为一个后端实现，处理标准 HTTP 方法，通过结构化端点暴露数据和功能。该项目包含一个可自定义的响应引擎，允许通过自定义格式化方法将输出数据转换为各种特定格式。该库提供了将传入的 HTTP 请求映射到控制器方法、管理资源响应以及实现基于配置的访问控制的工具。
Transforms server output into specific formats to meet the requirements of different third-party API consumers.
PHP
在 GitHub 上查看4,876
hashicorp/consul-template
hashicorp/consul-template
4,830在 GitHub 上查看
Consul Template 是一个配置渲染器和动态配置管理器，通过使用来自 Consul 和 Vault 的数据填充模板来生成文件。它作为服务发现模板引擎和密钥管理集成器，将集群目录和健康数据转换为格式化的配置文件。该工具通过充当进程监督者和通知者而脱颖而出，能够在模板更新后自动执行 Shell 命令或重启应用程序。它具有一个用于监控远程键值存储的长轮询观察器，并采用共享锁定机制来协调跨多个实例的更新，防止同时重启服务。该系统涵盖了广泛的功能，包括用于 PKI 证书和 Vault 凭据的自动化密钥轮换、用于 JSON 和 YAML 的数据格式转换，以及用于自定义数据处理的外部二进制插件的执行。它还提供基础设施引导和分布式渲染同步，通过基于领导者的查询去重来减少 API 负载。
Converts data structures into JSON, YAML, TOML, or base64 strings with pretty-printing.
Goconsulgolangvault
在 GitHub 上查看4,830
stleary/json-java
stleary/JSON-java
4,717在 GitHub 上查看
JSON-java 是一个用于解析和生成 JSON 文本并将其映射到 Java 对象和集合的 Java 库。它作为一个序列化框架，用于将类实例和数据结构转换为标准化的 JSON 字符串。该项目包含一个 JSON 指针实现，用于通过字符串或 URI 片段表示从文档中检索特定值。它还提供了一个转换器，用于在 JSON 和 XML 之间转换数据结构，以及一个用于在 JSON 与 HTTP 标头、Cookie 和逗号分隔列表等 Web 格式之间进行转换的翻译器。该库涵盖了 JSON 处理的广泛功能，包括对象序列化和反序列化。它支持将 JSON 文本灵活解析为对象，并支持生成标准化的 JSON 文档。
Transforms data between JSON and web-specific formats such as browser cookies and comma-delimited lists.
Javahackoberfest2023hacktoberfestjava
在 GitHub 上查看4,717
rudderlabs/rudder-server
rudderlabs/rudder-server
4,437在 GitHub 上查看
Rudder Server is a customer data platform and event routing pipeline designed to collect, transform, and route customer event data from various sources to data warehouses and business tools. It functions as a customer identity resolver, linking identifiers from multiple sources to build a unified identity graph and comprehensive behavioral customer profiles. The system differentiates itself through reverse ETL capabilities, which push processed customer segments and audiences from data warehouses back into operational third-party applications. It also provides a containerized data plane for K
Converts event data into destination-specific formats using a pipeline of enrichment, filtering, and anonymization functions.
Gobigquerycdpcustomer-data
在 GitHub 上查看4,437
mosaicml/llm-foundry
mosaicml/llm-foundry
4,415在 GitHub 上查看
llm-foundry 是一个大型语言模型训练框架，提供了一个用于基础模型预训练和监督微调的系统。它包括一个用于跨多个节点和 GPU 扩展工作负载的分布式训练器、一个用于从云存储加载数据的数据集流式传输管道，以及参数高效的微调实现。该框架通过使用参数分片和高吞吐量数据流来保持大规模训练期间的稳定性，从而脱颖而出。它结合了低秩自适应（LoRA）以降低计算成本，并使用 8 位浮点精度来提高兼容硬件上的计算速度。该代码库涵盖了广泛的功能，包括将原始数据转换为压缩格式的数据集工程、通过评估套件进行的模型性能基准测试，以及将模型权重导出为标准化行业格式的能力。它还支持通过装饰器进行自定义组件注册，并提供对位置嵌入方法的控制。
Transforms raw data into compressed, streaming-compatible formats to improve training efficiency and throughput.
Pythondeep-learningllmneural-networks
在 GitHub 上查看4,415
assemble/assemble
assemble/assemble
4,258在 GitHub 上查看
Assemble 是一个静态网站生成器和构建管道系统，将 Markdown、模板和数据编译为静态 HTML 文件。它作为一个 Markdown 到 HTML 转换器和数据格式转换器，能够在 JSON、YAML、XML、PLIST 和 CSV 格式之间移动内容。该项目具有基于管道的构建过程，用户可以在其中定义数据转换和文件处理步骤的有序序列。它包括项目脚手架工具，用于从预定义的样板中引导目录结构和配置文件。该系统通过基于集合的过滤和分层布局嵌套来管理内容，允许按标签和类别组织页面。它支持可插拔的模板引擎、可自定义的辅助函数，以及注入 YAML 前置元数据以控制渲染逻辑。该工具包还提供用于编译 LESS 样式表、管理站点固定链接以及监控文件更改以触发自动化构建任务的实用程序。
Converts files between JSON, YAML, XML, PLIST, and CSV formats using a transformation engine.
CSSassembleblog-enginebuild
在 GitHub 上查看4,258
andersao/l5-repository
andersao/l5-repository
4,205在 GitHub 上查看
本项目是一个 Laravel 的数据库抽象层，实现了存储库模式以将业务逻辑与 Eloquent 数据库查询解耦。它提供了一个用于数据检索、分页和过滤的标准接口。该系统包括一个查询标准机制，用于根据请求参数应用可重用的搜索条件，以及一个在记录创建、更新或删除期间自动清除存储结果的缓存包装器。它还具有一个展示层，用于将原始数据库模型属性转换为用户界面的格式化输出。其他功能包括用于脚手架模型、存储库、控制器和服务提供商的命令行工具，以及用于验证存储库数据和转换模型属性的工具。
Formats data objects using presenters to decouple internal database structures from the final output.
PHP
在 GitHub 上查看4,205
sylphai-inc/adalflow
SylphAI-Inc/AdalFlow
4,167在 GitHub 上查看
AdalFlow 是一个自主 AI 代理框架和 LLM 应用库，旨在构建模块化工作流。它作为一个模型无关的接口和 RAG 流水线编排器，允许用户开发 ReAct 代理，利用迭代推理和外部工具执行来解决复杂任务。该项目通过一个提示词优化系统脱颖而出，该系统使用文本梯度下降自动优化提示词模板和少样本示例。它将模型反馈视为可微分信号，实现了一种 LLM 反向传播形式，从而根据评估指标迭代提高输出质量。该框架涵盖了广泛的功能面，包括带有语义向量搜索和重排序的检索增强生成、用于可观测性的基于跨度的执行追踪，以及模式驱动的结构化解析。它为众多专有和开源模型提供商提供了统一的通信层，并支持将 Python 函数转换为标准化的工具接口。该系统使用 Python 实现，并与 MLflow 集成以进行工作流跟踪和分析。
Converts data between dictionaries, JSON, YAML, and dataclass objects to facilitate internal data movement.
Python
在 GitHub 上查看4,167
kashav/fsql
kashav/fsql
3,986在 GitHub 上查看
fsql 是一个命令行界面工具，提供了一种类似 SQL 的查询语言，用于查找本地磁盘上的文件和目录。它作为文件系统查询引擎，允许用户通过针对元数据执行结构化语句来隔离文件，而不是使用标准的命令行标志。该工具具有交互式读取-求值-输出循环 (REPL)，支持多行查询和递归子查询，其中嵌套搜索操作的结果作为外部查询的条件。搜索范围可通过绝对路径、相对路径、环境变量和 glob 模式的解析进行配置。该系统将代数运算符、正则表达式和逻辑过滤器应用于文件属性，如哈希、大小和修改时间。它包括数据转换实用程序，用于将这些属性格式化为人类可读的时间戳和标准化的大小单位。
Converts file attribute values into specific display formats, including size unit conversion and timestamp styling.
Gofindgolang
在 GitHub 上查看3,986
rdatatable/data.table
Rdatatable/data.table
3,894在 GitHub 上查看
该项目是一个针对 R 的高性能表格数据处理框架，旨在以内存效率和速度处理海量数据集。它提供了一种增强的数据结构，利用引用语义和就地修改来执行复杂的转换，而无需不必要的对象复制开销。该库凭借其底层架构优化脱颖而出，包括多线程并行处理、基数排序和内存映射文件解析。通过将关键的数据操作和聚合例程卸载到编译后的 C 代码，它实现了对原本计算昂贵的任务的快速执行。其核心引擎支持高级关系操作，如非等值连接、滚动连接和重叠区间连接，以及用于加速重复数据访问的自动二级索引。除了主要的处理功能外，该项目还提供了一套全面的数据生命周期管理工具。这包括具有自动类型检测的高速摄取和序列化工具，以及对时间序列分析和多维聚合的专门支持。该框架旨在实现可扩展性，允许用户在包含数十亿行的数据集上执行复杂的分组、过滤和重塑操作，同时保持系统稳定性和性能。
Converts tabular data between wide and long formats using optimized casting and melting operations.
R
在 GitHub 上查看3,894
multiprocessio/dsq
multiprocessio/dsq
3,866在 GitHub 上查看
dsq is a command-line utility that enables SQL-based analysis of local files by treating them as relational database tables. It allows users to execute standard SQL queries against heterogeneous data formats, including JSON, CSV, Excel, and Parquet, without requiring a formal database import process. The tool distinguishes itself by providing a persistent interactive shell for iterative data exploration and schema inspection. It supports complex operations such as joining data across multiple disparate files and converting between structured formats by applying SQL transformations to the inpu
Transforms input files into structured JSON output by applying SQL queries to the input data without requiring manual schema definitions.
Go
在 GitHub 上查看3,866
feross/buffer
feross/buffer
1,883在 GitHub 上查看
Buffer 是一个二进制数据操作库，提供了 Node.js 二进制数据应用程序编程接口（API）的浏览器兼容实现。它使开发者能够使用与服务器端标准一致的接口，在 Web 环境中创建、修改和处理原始二进制数据结构。该库通过提供跨平台 JavaScript 开发的统一方法脱颖而出，允许在服务器和浏览器环境之间共享代码。它通过 Polyfill 标准二进制方法并扩展原生字节数组原型来实现这一点，确保开发者可以在不依赖特定环境实现的情况下管理内存和数据结构。该工具包包括用于处理字节序感知数据访问和执行零拷贝切片以在不复制有效载荷的情况下操作内存段的工具。它还通过促进缓冲区、类型化数组和 Blob 之间的转换来支持广泛的数据兼容性，确保二进制数据可以在不同的 Web 界面和存储格式之间交换。
Ensures seamless data exchange between different web interfaces and storage formats by converting between buffers, typed arrays, and blobs.
JavaScriptbrowserbrowserifybuffer
在 GitHub 上查看1,883
rezach/my-budget
reZach/my-budget
956在 GitHub 上查看
My-budget is a cross-platform desktop application designed for personal finance management. It functions as a local-first budgeting tool that allows users to track income and expenses while maintaining complete control over their financial data without relying on cloud services. The application distinguishes itself by integrating automated transaction ingestion, which retrieves and parses financial records directly from banking websites. To ensure privacy, all stored transaction history and budget records are protected by local encryption using user-defined passphrases, keeping sensitive info
Converts raw scraped financial information into standardized internal formats for consistent tracking and reporting.
JavaScriptbudgetingfreesoftware
在 GitHub 上查看956

Awesome Data Format Transformations GitHub Repositories

pentaho/pentaho-kettle

AlaSQL/alasql

bookshelf/bookshelf

apache/pinot

cube2222/octosql

TurboWay/bigdata_analyse

kiln-ai/kiln

chriskacerguis/codeigniter-restserver

hashicorp/consul-template

stleary/JSON-java

rudderlabs/rudder-server

mosaicml/llm-foundry

assemble/assemble

andersao/l5-repository

SylphAI-Inc/AdalFlow

kashav/fsql

Rdatatable/data.table

multiprocessio/dsq

feross/buffer

reZach/my-budget

探索子标签