Dataframe

This library is a data processing framework for the JVM that provides a type-safe environment for manipulating structured tabular data. It functions as a comprehensive toolset for performing complex data transformations, aggregations, and statistical analysis, while leveraging compile-time schema validation to ensure structural integrity across data pipelines.

The project distinguishes itself through its deep integration with interactive notebook environments and its use of compile-time code generation. By automatically deriving and enforcing schemas from raw inputs, it generates type-safe accessors that enable IDE autocompletion and static verification of column names. This architecture allows developers to perform functional pipeline processing while maintaining strict type safety, effectively preventing runtime errors during data manipulation.

The library supports a broad range of data workflows, including importing and mapping relational database schemas, performing geospatial analysis, and executing complex data pivoting. It includes extensive utilities for data construction, filtering, sorting, and the calculation of descriptive statistics. Furthermore, the framework provides robust visualization and reporting capabilities, allowing users to render interactive HTML tables, compose documents, and generate charts directly from structured datasets.

The library is designed for seamless use within Kotlin and Java development environments, with specialized support for automated dependency management and kernel integration in interactive notebooks.

Features

Data Analysis Frameworks - Provides a comprehensive toolset for complex data transformations, aggregations, and statistical analysis within JVM environments.
Tabular Data Analysis - Enables interactive data processing and visualization directly within notebook environments for rapid exploration.
Data Processing Libraries - Provides a type-safe library for manipulating structured tabular data with compile-time schema validation and IDE autocompletion.
Type-Safe Schema Definitions - Generates and enforces data schemas using code-based interfaces to ensure compile-time safety and IDE autocompletion.
Type-Safe Structured Data Frameworks - Enforces strict property requirements and data integrity for structured data objects using compile-time type checking.
Group-By Aggregations - Partitions rows by key values to compute summary statistics like sums and counts.
SQL Data Loaders - Converts database tables and query results into structured data frames with memory-efficient row limits.
Type-Safe Data Transformations - Provides a type-safe, functional pipeline for filtering, aggregating, and transforming tabular data.
Data Format Importers - Parses structured data from files, databases, and strings into unified, type-safe data structures.
SQL Schema Integrations - Imports and maps relational database schemas into structured objects to simplify querying and aggregation.
Functional Data Pipelines - Transforms data through a series of immutable operations that maintain type safety and structural integrity.
Compile-Time Code Generation - Generates type-safe accessors and extension properties at compile time to enable IDE autocompletion.
Type-Safe Row Scanning - Maps tabular column identifiers to strongly-typed object properties to prevent runtime errors during data manipulation.
Automatic Schema Derivations - Automatically derives and enforces data structures from raw inputs to ensure consistent and reliable column access.
Interactive Notebooks - Provides a data manipulation engine that integrates with notebook kernels for visual exploration and structured analysis.
Data Reporting - Transforms processed datasets into interactive HTML tables and formatted reports for visual data summaries.
Schema-Driven Data Normalizers - Projects untyped input data onto predefined interfaces to enforce structural consistency.
Data Reshaping Operations - Reshapes grouped data into matrix-like structures by rotating column values into new headers.
Database Layout Extraction - Extracts structural metadata from database tables and query results to simplify mapping data fields.
Row Aggregations - Computes mathematical aggregates like sums and standard deviations across row values.
Schema Inference - Maps untyped data to defined interfaces or classes to enforce column names and types throughout the processing pipeline.
Table-to-HTML Converters - Converts tabular data structures into interactive HTML tables with support for hierarchical data and custom formatting.
Tabular Data Manipulations - Creates structured datasets from collections of values for organized storage and manipulation.
Notebook Execution Environments - Executes data processing workflows directly within interactive environments using specialized kernel support and automated dependency management.
Notebook Environment Integrations - Configures data manipulation tools across development environments and notebook kernels to enable structured data analysis.
Notebook Rendering Utilities - Displays tabular data as interactive, formatted tables within notebook cells to facilitate visual inspection.
Data Visualization Libraries - Generates charts and plots directly from data structures using a type-safe plotting language.
Column Summary Calculators - Calculates column types, null counts, and basic descriptive statistics to provide an overview of dataset structure.
Data Grid Row Sorting - Orders datasets based on column values and extracts specific subsets of rows.

dask/dask

13,746View on GitHub

Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl

man-group/dtale

5,170View on GitHub

dtale is a web-based interactive grid and visualizer for pandas dataframes, designed as an exploratory data analysis tool. It provides a browser-based interface for analyzing tabular data structures, allowing users to calculate statistics, detect outliers, and compute correlations without writing manual code. The project functions as an embedded data viewer that can be integrated into web applications via iframes or custom routes, with specific support for Django, Flask, and Streamlit. It enables the exploration of datasets through a combination of an interactive data grid and a data visualiz

datawhalechina/joyful-pandas

5,164View on GitHub

This project is a comprehensive pandas data analysis tutorial and instructional guide designed for learning data manipulation and analysis. It serves as a tabular data processing guide and a manual for time series analysis, providing a structured approach to cleaning, merging, and transforming datasets. The repository functions as a data feature engineering course, providing tutorials on constructing and selecting dataset features to improve machine learning model performance. It also includes a vectorized data operations guide for performing element-wise mathematical computations and matrix

tidyverse/dplyr

5,034View on GitHub

dplyr is an R data manipulation library that provides a grammar for transforming tabular data frames. It functions as an in-memory data frame processor and a relational data algebra tool, using a consistent set of verbs to filter, select, and summarize data. The project includes a SQL translation engine that converts high-level data manipulation expressions into optimized queries. This allows users to perform transformations directly on remote relational databases and cloud storage without pulling data locally. The library covers a broad range of tabular operations, including column mutation

Kotlindataframe

Features