2 个仓库
Applying user-defined functions to independently grouped data subsets and aggregating the results.
Distinct from User-Defined Data Functions: Distinct from general UDFs as it specifically handles the split-apply-combine pattern on grouped data.
Explore 2 awesome GitHub repositories matching data & databases · Grouped Function Application. Refine with filters or upvote what's useful.
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Implements the split-apply-combine pattern by executing user-defined functions on independently grouped data subsets.
该项目是一个针对 R 的高性能表格数据处理框架,旨在以内存效率和速度处理海量数据集。它提供了一种增强的数据结构,利用引用语义和就地修改来执行复杂的转换,而无需不必要的对象复制开销。 该库凭借其底层架构优化脱颖而出,包括多线程并行处理、基数排序和内存映射文件解析。通过将关键的数据操作和聚合例程卸载到编译后的 C 代码,它实现了对原本计算昂贵的任务的快速执行。其核心引擎支持高级关系操作,如非等值连接、滚动连接和重叠区间连接,以及用于加速重复数据访问的自动二级索引。 除了主要的处理功能外,该项目还提供了一套全面的数据生命周期管理工具。这包括具有自动类型检测的高速摄取和序列化工具,以及对时间序列分析和多维聚合的专门支持。该框架旨在实现可扩展性,允许用户在包含数十亿行的数据集上执行复杂的分组、过滤和重塑操作,同时保持系统稳定性和性能。
Executes custom calculations on subsets of data within each group for complex analytical workflows.