12 Repos
Support for loading and processing tabular data using dataframe abstractions.
Distinguishing note: Focuses on dataframe-specific loading rather than general file parsing.
Explore 12 awesome GitHub repositories matching data & databases · Dataframe Engines. Refine with filters or upvote what's useful.
Nushell is a cross-platform shell and programming language designed to treat all input and output as structured data rather than raw text streams. By enforcing data types and command signatures, it provides a consistent environment for building robust, pipeline-oriented workflows. The shell allows users to chain commands that pass structured objects between stages, enabling complex data processing and automation tasks that remain predictable across different operating systems. What distinguishes the project is its focus on interactive data exploration and modular extensibility. Users can quer
Supports importing data as eager or lazy dataframes for optimized query execution.
This project serves as a comprehensive textbook and educational resource for data analysis using the Python ecosystem. It provides a structured guide to manipulating, cleaning, and processing datasets, focusing on the core tools required for numerical computing and statistical analysis. The repository distinguishes itself by offering a collection of practical code examples and workflows that demonstrate how to perform complex data tasks. It covers the application of vectorized numerical computations, the management of time-indexed data, and the creation of statistical visualizations to commun
Provides dataframe-based relational modeling for filtering, joining, and aggregating structured datasets.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Provides engines for cleaning and transforming tabular data using dataframe abstractions.
Pygwalker is a library that transforms tabular data into interactive, drag-and-drop interfaces for exploratory analysis and visualization. It functions as a grammar-based framework that translates user interactions into declarative chart definitions, allowing for the creation of dynamic data exploration environments directly within notebooks or embedded web applications. The system distinguishes itself by offloading heavy analytical computations to backend kernels, which maintains responsiveness when visualizing large datasets. It supports the serialization of visual states into portable conf
Provides interactive drag-and-drop visualization capabilities specifically for dataframe-based tabular data.
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Provides a distributed dataframe engine for loading and processing tabular data that exceeds system memory.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Ingests Pandas DataFrames directly into tables to bridge vector storage and data analysis workflows.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Provides a lazy DataFrame API for building and executing analytic queries programmatically.
VisiData is a terminal-based interactive data analysis tool and browser designed for exploring, filtering, and sorting large tabular datasets. It functions as a structured data inspector that loads and flattens complex formats like JSON, XML, and PCAP into interactive sheets, as well as a terminal file manager for navigating directories and performing staged filesystem operations. The project distinguishes itself by rendering data visualizations, such as scatter plots and histograms, directly in the terminal using Unicode Braille characters. It provides a Python-based data wrangling environme
Integrates with Pandas dataframe abstractions to load and process complex tabular data.
VectorBT is a vectorized trading strategy backtesting framework that simulates thousands of strategy configurations in a single pass over historical price data. It operates as a parameter optimization engine, a portfolio performance analyzer, a technical indicator calculator, and a financial data fetcher, all built around a DataFrame-centric data model that uses NumPy broadcasting for signal alignment and compiled code acceleration for performance. The framework distinguishes itself through its ability to run large-scale parameter sweeps by constructing every combination of strategy parameter
Represents all financial time series, signals, and portfolio states as pandas DataFrames.
This is a pandas-based technical analysis library and financial feature engineering tool. It serves as a vectorized indicator calculator that transforms raw price and volume data into derived metrics for time series analysis. The library uses a NumPy-based engine to perform mathematical operations across entire arrays, avoiding iterative loops to maintain high performance. It organizes technical indicators into a modular class hierarchy with a consistent interface, allowing for bulk feature generation and the direct appending of results as new columns to a pandas DataFrame. The system covers
Integrates computed indicators directly into pandas DataFrame structures while preserving time series alignment.
This is a library of cryptocurrency trading algorithms and technical analysis strategies designed for use with the Freqtrade trading bot. The project provides a collection of pre-defined rules and mathematical indicators used to automate the buying and selling of digital assets. The repository focuses on algorithmic trading strategies and bot-driven asset management to remove manual execution from cryptocurrency trades. It enables quantitative trading analysis by allowing the development and testing of rule-based logic against historical market data. The system utilizes class-based strategy
Uses pandas dataframes to perform vectorized calculations on historical candle data for fast technical indicator analysis.
XlsxWriter is a library for generating spreadsheets in the XLSX format, functioning as an Excel workbook writer and file generator. It provides the capability to write data, apply cell formatting, and build complex layouts across multiple worksheets. The project distinguishes itself with a memory-optimized writing mode that flushes large datasets to disk row-by-row, enabling the creation of files exceeding 4 GB while minimizing RAM consumption. It also includes a specialized mechanism for embedding binary project files and digital signatures to enable VBA macros and signed scripts within work
Supports writing external dataframes to specific worksheets and exact cell coordinates.