Polars

Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters.

The project distinguishes itself through a sophisticated lazy query engine that constructs abstract execution plans. By deferring data operations until collection, the engine performs predicate and projection pushdown to minimize memory overhead and data passes. It further optimizes performance through a multi-threaded parallel execution model and a streaming batch processor, which allows for the analysis of datasets that exceed available system memory by processing them in manageable chunks.

The library provides a comprehensive expression framework for complex data engineering, supporting aggregation, arithmetic, and logical transformations across various data types, including nested structures and categorical data. It integrates with external systems through native connectivity for cloud storage, relational databases, and remote repositories, while offering diagnostic tools to visualize query plans and monitor performance.

Polars is available as a native library with language bindings for Python and R, allowing users to integrate high-performance data manipulation into existing analytical pipelines without complex build steps.

Features

Analytical Data Engines - Processes large-scale tabular datasets with optimized memory usage and fast execution for complex analytical tasks.
Columnar Data Processors - Organizes information into typed columns to enable fast analytical queries and efficient memory utilization.
Distributed Query Engines - Runs data processing queries across a distributed cluster by triggering remote, parallelized computation.
Lazy Evaluation Frameworks - Delays data operations until collection to allow for predicate and projection pushdown optimizations.
Lazy Query Engines - Constructs and optimizes abstract execution plans to enable predicate and projection pushdown.
Lazy Query Pipelines - Builds data processing pipelines using lazy evaluation for modular query construction.
Memory Formats - Implements the Apache Arrow memory format for zero-copy data sharing and high-performance interoperability.
Query Engines - Constructs and optimizes abstract execution plans to minimize data passes and memory overhead during computation.
Query Execution Engines - Generates and executes efficient plans that distribute workloads across all available processor cores.
Columnar Storage Engines - Uses a columnar memory layout to enable cache-friendly processing and efficient vectorized operations.
Compute Contexts - Assigns compute contexts to remote queries to manage execution environments dynamically.
Data Processing Libraries - Organizes information into typed columns to enable efficient memory utilization and fast query execution.
Distributed Analytical Runtimes - Scales data processing workflows from local machines to multi-node clusters for massive datasets.
Distributed Data Processing - Scales data processing workflows from local machines to multi-node clusters for parallelized execution.
Parquet Readers - Loads Parquet files directly into datasets for immediate processing.
Parquet Scanners - Scans Parquet files to create lazy computation holders, enabling predicate and projection pushdown.
Query Optimizers - Optimizes query execution by filtering rows and selecting columns as close to the source as possible.
Query Planning - Constructs and optimizes abstract query plans to minimize data passes and memory overhead.
Remote Query Execution - Runs data processing queries on remote infrastructure using the same interface as local operations.
Compute Cluster Orchestration - Controls the lifecycle of remote compute clusters using context managers and reusable configuration manifests.
Expression Engines - Executes data transformations using a high-performance compiled expression engine.
Grouped Aggregations - Summarizes data by grouping rows based on unique values and applying expressions to each subset.
Out-of-Core Processing - Processes massive files that exceed available system memory by streaming data in smaller chunks.
Parallel Processing - Distributes data processing tasks across available CPU cores to maximize throughput.
Schema Definitions - Maps column names to specific data types to enforce structure during dataset creation.
Streaming Data Pipelines - Handles datasets exceeding system memory through a streaming batch processing pipeline.
Python Tooling - Provides a high-performance interface for Python users to execute complex data workflows and analytical queries.
Language Bindings - Enables R users to perform complex data transformations and analytical operations using a consistent, high-performance syntax.
Data Analysis - Fast DataFrame library implemented in Rust.
Data Manipulation - Fast multi-threaded dataframe library.
Data Manipulation Libraries - Multithreaded, vectorized query engine for DataFrames.
Data Processing Libraries - Multi-threaded, memory-efficient dataframe library.
Numerical Libraries - Blazingly fast DataFrame library for structured data manipulation.
Scientific Computing Libraries - Fast DataFrame library implemented in Rust.
Cloud Data Connectors - Provides high-performance native connectivity for reading and writing data across cloud storage and relational databases.
Data Connectors - Connects to local files, cloud storage, and remote databases for data ingestion and export.
Data Filtering - Removes rows from datasets by applying boolean expressions that satisfy specified conditions.
Data Type Managers - Organizes numeric, temporal, and nested data types while handling null values and type inference.
Lazy Data Scanning - Scans files to create lazy computation holders that defer parsing until execution.
Multi-file Aggregators - Reads and combines multiple files into a single data structure using glob patterns.
Series Constructors - Generates one-dimensional data structures containing elements of a single type.
Single-Node Processing - Runs queries on a single compute node to simplify execution logic and avoid data shuffling overhead.
Categorical Data Optimization - Creates categorical columns that infer categories from data to reduce memory usage and increase speed.
Cloud Data Access - Reads data files directly from cloud storage buckets using URI paths.
Database Connectivity - Retrieves data from relational databases into datasets using connection strings and specialized drivers.
Partitioned Data Scanners - Scans partitioned datasets and automatically parses partition keys from the file structure.
Partitioned Data Writers - Saves datasets to partitioned files by organizing output into directory structures based on columns.
Remote Function Execution - Runs custom functions and external libraries on remote compute instances by including necessary dependencies.
Resource Allocation - Sets hardware requirements for remote query execution by specifying CPU and memory needs.
Structured Data Schemas - Supports complex schemas, nested structures, and categorical types to ensure data integrity during ETL workflows.
Vectorized Mapping - Processes entire series as single batches to enable efficient vectorized execution.
Window Functions - Performs aggregations on specific groups within a selection context, mapping results back to original rows.
Query Performance Monitoring - Tracks query performance using dashboards displaying real-time metrics and resource usage.
Boolean Logic Engines - Applies boolean and bitwise logic to series to filter and transform data based on complex criteria.
Cluster Node Management - Defines cluster node settings including identifiers, license paths, and memory limits for cluster deployments.
Column Transformation - Appends new columns to datasets by applying expressions while preserving original data.
CSV Processing - Reads and writes CSV files to and from datasets using standard file-based operations.
Data Encoding Optimizations - Optimizes memory usage by representing repeated string data as numeric placeholders.
Data Sinking - Saves large-scale query results directly to cloud storage to support automated data pipelines.
Data Type Casting - Converts the data type of a column to a new format with strict error handling.
Database Connectors - Saves dataset contents to relational database tables using connection strings and native drivers.
Lazy JSON Scanners - Scans newline-delimited JSON files to create lazy computation holders.
Numerical Library Integrations - Executes fast element-wise mathematical operations by applying universal functions directly to columnar data.
Query Schedulers - Manages scheduler operations by defining worker counts and access control policies.
Remote Environment Management - Defines remote compute environments by specifying dependency files for consistent execution.
Vectorized Arithmetic - Executes arithmetic operations between series with automatic broadcasting and missing value handling.
Kubernetes Deployments - Launches clusters on container orchestration platforms using configuration files for resource scheduling.
Runtime Environments - Imports data analysis tools directly into runtime environments for native calculation.

modin-project/modin

10,389View on GitHub

Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h

vaexio/vaex

8,506View on GitHub

Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin

dask/dask

13,746View on GitHub

Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl

pandas-dev/pandas

49,039View on GitHub

Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized

pola-rspolars

Features