14 个仓库
High-performance data processing utilizing the Apache Arrow columnar memory format.
Distinguishing note: Existing candidates were for ECharts or Thrift; no specific Apache Arrow processing tag existed in the shortlist.
Explore 14 awesome GitHub repositories matching data & databases · Apache Arrow Processing. Refine with filters or upvote what's useful.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Uses the high-performance Apache Arrow columnar memory format to transfer large datasets between servers and clients.
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Implements chart annotations including arrows, brackets, callouts, and text labels to highlight specific data points.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Stores and processes data in Apache Arrow's columnar format for zero-copy sharing and vectorized operations.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Provides a high-performance DataFrame library based on the Apache Arrow columnar memory layout.
Fireworks Tech Graph is a tool that generates SVG and PNG technical diagrams from natural language descriptions, supporting both English and Chinese input. It produces publication-quality diagrams for AI architectures, UML types, and other technical domains without requiring manual drawing or diagramming syntax. The tool distinguishes itself through a semantic shape vocabulary and arrow-based flow encoding that conveys component roles and data flow types through consistent geometric shapes, stroke widths, dash patterns, and colors rather than relying on textual labels. It renders the same dia
Encodes flow types with line width, dash pattern, and color for clear communication in diagrams.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Converts retrieval job results into Apache Arrow tables for efficient columnar access.
ScottPlot is a cross-platform, high-performance charting library for .NET that renders interactive plots across desktop and web GUI frameworks including Windows Forms, WPF, MAUI, Avalonia, Blazor, and WinUI. It provides an optimized rendering engine capable of displaying millions of data points with interactive pan, zoom, and live data streaming, while also supporting image export to formats like PNG and SVG for file output, cloud applications, and notebooks. The library distinguishes itself through a comprehensive set of chart types including scatter, line, bar, pie, heatmap, financial, rada
Place an arrow pointing to a specific location in coordinate space, with extensive customization options.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Aggregates multiple tables and sends them in a single gRPC request using Arrow IPC.
这是一个图形语法可视化库,用于通过将表格数据映射到视觉标记来构建图表。它作为一个 SVG 数据可视化工具和探索性数据分析 API,允许用户渲染复杂的可视化效果和地理地图。 该库具有一个 GeoJSON 地图渲染器,可将球坐标投影到二维像素空间,以及一个用于高效数据处理的 Apache Arrow 可视化接口。 其功能面涵盖通过分箱(binning)和分组进行数据转换、通过自动比例推断和配色方案应用进行视觉编码,以及生成小多重图(small multiples)。它支持在分层视图中渲染几何形状,并在服务器端环境中导出静态图像。
Processes diverse input structures, including high-efficiency Apache Arrow tables, for optimized data visualization.
GluonTS is a framework for probabilistic time series forecasting, designed to predict future values as probability distributions with confidence intervals. It supports both traditional model training and zero-shot forecasting, where pretrained models generate predictions for new series without additional training. The project distinguishes itself by integrating a wide variety of forecasting approaches into a unified workflow. This includes deep learning architectures such as recurrent neural networks and causal convolutions, as well as the integration of external statistical models, the Proph
Transforms serialized Apache Arrow data back into time series formats with optional column reshaping.
GluonTS 是一个概率时间序列库和深度学习预测框架。它提供了一套工具包,用于构建、训练和评估神经网络架构,通过将未来值预测为概率分布来量化不确定性。 该项目的独特之处在于支持零样本(zero-shot)预测,并集成了多种建模方法,包括深度概率神经网络以及对 Prophet 和 R forecast 等外部统计库的封装。它实现了因果卷积和可逆残差网络等专门的架构原语,以防止信息泄露并将潜在表示映射为有效的概率分布。 该框架涵盖了全面的数据工程功能,包括时间序列缩放、双射变换和分层建模。它利用 Apache Arrow 和 Parquet 进行高性能数据集流式传输和随机访问管理。在模型评估方面,它包含一套评估套件,使用分位数损失(quantile loss)和连续排名概率分数(CRPS)等指标来衡量预测准确性和概率覆盖率。 该库支持通过集成 Amazon SageMaker 进行模型部署。
Utilizes the Apache Arrow columnar memory format for high-performance data processing and streaming.
这个 C++ 数据可视化库是一个科学绘图框架,用于创建 2D 和 3D 图表、网络图和地理地图。它作为一个多后端图形库运行,将高级绘图逻辑与低级渲染引擎解耦,以支持各种输出后端。 该项目以其双接口 API 脱颖而出,既提供用于快速原型的全局函数接口,也提供用于精确控制的面向对象接口。它具有一个用于管理平铺网格和子图的基于组件的布局引擎,以及一个允许在不清除坐标轴的情况下叠加多个数据系列的层级绘图状态。 该库涵盖了广泛的可视化功能,包括数学函数绘图、向量场,以及通过热力图和平行坐标进行的多维数据分析。它包括用于地理数据可视化的专用工具(如地理气泡图和地理密度图),以及用于渲染有向和无向图网络的工具。通用功能包括坐标轴管理、带有色图的美学样式,以及高质量图形的导出。 该项目利用 CMake 进行构建自动化和依赖检索,以促进在不同操作系统上的安装。
Implements visual annotations such as directed arrows and text labels to highlight specific data points.
Fury 是一个多语言二进制序列化框架,旨在编码领域对象和复杂图,以促进跨语言数据交换。它包括一个接口定义语言(IDL)编译器,可将模式定义转换为多种语言中的惯用原生类型和序列化样板代码。 该框架通过零拷贝二进制读取器脱颖而出,该读取器允许在不反序列化整个对象的情况下访问特定字段,以及一个保留循环引用和引用完整性的对象图序列化器。它还具有一个数据转换器,可将基于行的二进制数据转换为用于分析工作负载的列式 Apache Arrow 格式。 该框架涵盖了广泛的功能领域,包括用于前向和后向兼容性的元数据驱动模式演进、用于消除运行时反射的构建时 AOT 编译过程,以及通过基于白名单的类型验证进行的安全反序列化。它还为通过 gRPC 进行的高性能远程过程调用提供了集成。
Converts serialized row-based data into Apache Arrow columnar formats to enable high-performance analytical workloads.
Uptrace is an OpenTelemetry-based observability platform designed to collect, store, and analyze distributed traces, metrics, and logs. It functions as a centralized logging backend, a distributed tracing system, and a metrics engine to monitor application performance and system health. The platform is distinguished by AI-powered operational capabilities, allowing users to query telemetry data and manage monitoring dashboards using natural language. It specifically includes specialized monitoring for generative AI pipelines, tracking token usage and response quality for LLM interactions and r
Transports tracing, metrics, and logs using the OTel Arrow columnar format to reduce bandwidth consumption.