# Distributed Dataframe Engines

> Search results for `distributed dataframe engine for big datasets` on awesome-repositories.com. 115 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/distributed-dataframe-engine-for-big-datasets

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/distributed-dataframe-engine-for-big-datasets).**

## Results

- [hosseinmoein/dataframe](https://awesome-repositories.com/repository/hosseinmoein-dataframe.md) (2,917 ⭐) — DataFrame is a C++ tabular data library and manipulation engine designed for managing heterogeneous data in contiguous memory. It functions as a statistical analysis framework and time series analysis toolkit, providing the means to store, index, and transform multidimensional datasets.

The project distinguishes itself through a high-performance execution model that utilizes column-major storage, SIMD-aligned memory allocation, and a thread-pool for parallel computations. It employs a visitor-based algorithm dispatch system and policy-driven transformations to decouple data processing logic f
- [distribution/distribution](https://awesome-repositories.com/repository/distribution-distribution.md) (10,479 ⭐) — Distribution is an open-source container image registry that implements the OCI Distribution Specification, enabling any OCI-compatible client to push, pull, and manage container images over standard protocols. It serves as a content distribution toolkit for packaging, shipping, storing, and delivering container content across networked environments, storing and retrieving content by its cryptographic hash for integrity and deduplication.

The registry separates image metadata from bulk data to enable efficient validation and partial pulls, and supports resumable blob uploads with chunked tran
- [dask/dask](https://awesome-repositories.com/repository/dask-dask.md) (13,746 ⭐) — Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements.

The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
- [huggingface/datasets](https://awesome-repositories.com/repository/huggingface-datasets.md) (21,643 ⭐) — Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams.

The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
- [apache/datafusion](https://awesome-repositories.com/repository/apache-datafusion.md) (8,908 ⭐) — Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules.

The engine distinguishes itself through its modular extension framework, which enables building custom query e
- [rocketlaunchr/dataframe-go](https://awesome-repositories.com/repository/rocketlaunchr-dataframe-go.md) (1,287 ⭐) — DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
- [apache/seatunnel](https://awesome-repositories.com/repository/apache-seatunnel.md) (9,427 ⭐) — SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance.

The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
- [rare-technologies/gensim](https://awesome-repositories.com/repository/rare-technologies-gensim.md) (16,442 ⭐) — Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data.

The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters.

The library covers a broad range of analysis capabilities
- [eczarny/spectacle](https://awesome-repositories.com/repository/eczarny-spectacle.md) (13,631 ⭐) — Spectacle is a keyboard-driven window manager and organizer that uses system accessibility frameworks to manipulate window coordinates and dimensions. It allows for the arrangement, resizing, and movement of application windows across multiple displays using global keyboard shortcuts.

The tool focuses on multi-monitor layout management, enabling users to shift active windows between connected displays and snap windows into predefined screen regions such as halves, thirds, or corners. It also provides the ability to center and maximize windows to optimize screen real estate without using a mou
- [featuretools/featuretools](https://awesome-repositories.com/repository/featuretools-featuretools.md) (7,655 ⭐) — Featuretools is a Python data science library and automated feature engineering framework designed to create predictive features from multiple related datasets. It automates the data preparation and transformation steps required for machine learning models through deep feature synthesis.

The library enables the automatic generation of comprehensive feature tables by applying recursive transformations to relational data. It supports the transformation of unstructured text into structured numeric features and allows users to define custom primitives to extend the synthesis process with specific
- [oxnr/awesome-bigdata](https://awesome-repositories.com/repository/oxnr-awesome-bigdata.md) (14,454 ⭐) — This project is a curated directory of software, frameworks, and educational resources designed for building, scaling, and maintaining distributed data processing and storage architectures. It serves as a comprehensive index for the distributed computing ecosystem, helping users identify the appropriate tools for managing large-scale information systems.

The repository functions as a central hub for data engineering, offering categorized access to technologies that support batch and stream processing, machine learning, and interactive querying. By organizing these resources, it assists in the
- [f/prompts.chat](https://awesome-repositories.com/repository/f-prompts-chat.md) (163,814 ⭐) — This platform serves as a centralized management system for organizing, refining, and versioning AI instructions and agent skills. It functions as a repository that enables users to store, categorize, and retrieve structured prompts, ensuring consistent performance across various artificial intelligence models. By integrating with the Model Context Protocol, the system allows external AI assistants and development environments to discover and access these instruction libraries directly.

The platform distinguishes itself through its focus on prompt engineering and automated refinement, utilizi
- [sciruby/distribution](https://awesome-repositories.com/repository/sciruby-distribution.md) (51 ⭐) — Probability distributions for Ruby.
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through ad
- [federatedai/fate](https://awesome-repositories.com/repository/federatedai-fate.md) (6,048 ⭐) — FATE is an open-source federated learning platform that enables multiple organizations to collaboratively train machine learning models without exposing raw data to any party. It provides a complete framework for private data collaboration, allowing participants to jointly compute on sensitive information while maintaining data privacy and security guarantees through secure multi-party computation protocols.

The platform distinguishes itself through its comprehensive infrastructure management capabilities, supporting automated deployment of multi-party clusters using Ansible-driven provisioni
- [dask/distributed](https://awesome-repositories.com/repository/dask-distributed.md) (1,671 ⭐) — A distributed task scheduler for Dask
- [ydataai/ydata-profiling](https://awesome-repositories.com/repository/ydataai-ydata-profiling.md) (13,388 ⭐) — Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments.

The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
- [chainlit/chainlit](https://awesome-repositories.com/repository/chainlit-chainlit.md) (12,213 ⭐) — Chainlit is a Python framework designed for building and deploying interactive, stateful conversational AI interfaces. It provides a backend-driven platform that connects language models and agent frameworks to a web-based chat frontend, managing the complexities of session state, message history, and real-time communication.

The framework distinguishes itself by offering a component-based UI builder that allows developers to inject interactive widgets, rich media, and data visualizations directly into the chat stream. It supports the visualization of complex agent workflows, enabling users t
- [jordipolo/dataframe](https://awesome-repositories.com/repository/jordipolo-dataframe.md) (63 ⭐) — Package providing functionality similar to Python's Pandas or R's data.frame()
- [dair-ai/prompt-engineering-guide](https://awesome-repositories.com/repository/dair-ai-prompt-engineering-guide.md) (75,678 ⭐) — This project is a comprehensive educational resource and technical guide focused on the development, optimization, and application of large language models. It provides a structured curriculum for mastering prompt engineering, ranging from foundational principles of instruction design to advanced techniques for improving model reasoning, accuracy, and reliability.

The guide distinguishes itself by offering deep technical insights into agentic workflows and autonomous system design. It covers the implementation of multi-step reasoning chains, tool integration through function calling, and stat
- [modin-project/modin](https://awesome-repositories.com/repository/modin-project-modin.md) (10,389 ⭐) — Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors.

The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
- [goabstract/marketing-for-engineers](https://awesome-repositories.com/repository/goabstract-marketing-for-engineers.md) (13,153 ⭐) — Marketing-for-Engineers is a product marketing resource library and bootstrapping guide designed for software engineers. It serves as an operational manual for independent creators to start, fund, and manage a sustainable internet business.

The project provides a customer acquisition playbook and a growth hacking toolkit, focusing on validating product-market fit and automating marketing workflows. It includes a content marketing framework that covers SEO, audience research, and distribution channels to convert readers into users.

The library covers a broad range of capability areas, includi
- [src-d/datasets](https://awesome-repositories.com/repository/src-d-datasets.md) (347 ⭐) — source{d} datasets ("big code") for source code analysis and machine learning on source code
- [kananinirav/aws-certified-cloud-practitioner-notes](https://awesome-repositories.com/repository/kananinirav-aws-certified-cloud-practitioner-notes.md) (3,829 ⭐) — This project is a collection of structured study notes and conceptual breakdowns designed for the AWS Certified Cloud Practitioner exam. It serves as a technical reference and study guide, organizing cloud service details and architectural principles to assist in certification preparation.

The knowledge base is built using markdown files and includes curated cheat sheets and interactive mind-map visualizations. These tools map complex certification topics into visual hierarchies to enable drill-down study paths and rapid revision.

The materials cover a wide range of cloud capabilities, inclu
- [leonardomso/33-js-concepts](https://awesome-repositories.com/repository/leonardomso-33-js-concepts.md) (66,467 ⭐) — This project is a comprehensive educational repository designed to help developers master the core mechanics, runtime behaviors, and browser-native capabilities of the JavaScript language. It provides a structured knowledge base that covers fundamental language features, such as prototype-based inheritance and event-loop-based concurrency, alongside advanced topics like JIT-compiled execution and memory management.

The repository distinguishes itself by offering deep-dive technical guides that bridge the gap between abstract language concepts and practical browser implementation. It features
- [apache/hadoop](https://awesome-repositories.com/repository/apache-hadoop.md) (15,567 ⭐) — Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster.

The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
- [conardli/easy-dataset](https://awesome-repositories.com/repository/conardli-easy-dataset.md) (13,394 ⭐) — Easy-dataset is a comprehensive platform designed for the end-to-end management of machine learning datasets, specifically tailored for language and vision model fine-tuning. It functions as a centralized environment for the entire data lifecycle, encompassing the automated generation of synthetic training data, the structural organization of document collections, and the systematic annotation of individual data points.

The platform distinguishes itself through its integrated evaluation and orchestration capabilities. It provides a dedicated suite for benchmarking models, featuring blind side
- [lisadziuba/marketing-for-engineers](https://awesome-repositories.com/repository/lisadziuba-marketing-for-engineers.md) (13,153 ⭐) — Marketing-for-Engineers is a curated knowledge base and set of conceptual guides designed to help developers implement growth strategies, product marketing, and user acquisition methods. It serves as a structured resource for learning how to acquire initial users and scale digital products.

The project provides specific frameworks for content marketing, user acquisition strategies, and marketing automation. It includes guides for creating search engine optimized articles, executing cold outreach, and utilizing influencer partnerships to gain traction.

The repository covers a broad range of g
- [gocolly/colly](https://awesome-repositories.com/repository/gocolly-colly.md) (25,101 ⭐) — Colly is a high-performance web scraping framework designed for the automated extraction of structured data from websites. It provides a programmable toolkit that manages the complexities of large-scale data collection, including concurrent request orchestration, automatic cookie handling, and robots.txt compliance. By utilizing an asynchronous execution model, the engine maintains high throughput while preventing resource exhaustion during recursive or distributed crawling tasks.

The framework is distinguished by its modular, event-driven architecture, which allows developers to hook into sp
- [nodesource/distributions](https://awesome-repositories.com/repository/nodesource-distributions.md) (13,834 ⭐) — This project is a Node.js binary distribution repository and Linux package repository. It provides a hosted set of pre-compiled JavaScript runtime binaries for various Linux distributions to simplify installation and version management through native package managers.

The project includes a Node.js observability toolset and security policy manager. These components enable the gathering of runtime telemetry to monitor application health and performance via diagnostic dashboards, while providing a resource restriction layer that intercepts system calls to prevent unauthorized modules from acces
- [apache/spark](https://awesome-repositories.com/repository/apache-spark.md) (43,467 ⭐) — Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine.

The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets.

The engine incorporates relational query e
- [docker/distribution](https://awesome-repositories.com/repository/docker-distribution.md) (10,474 ⭐) — This project is a container image registry and server-side storage system designed to house container images, layers, and manifests. It functions as an OCI compliant registry server that adheres to the Open Container Initiative Distribution Specification to store and deliver content over HTTP.

The system provides a self-hosted solution for managing private libraries of container images within professional-grade infrastructure. It is designed to enable the development of custom registries by extending a base toolkit with specialized libraries and business logic.

The registry covers image dist
- [databricks/learning-spark](https://awesome-repositories.com/repository/databricks-learning-spark.md) (3,899 ⭐) — This project is a learning curriculum and programming guide for Apache Spark, providing a structured set of educational resources and practical code examples for mastering distributed data processing. It serves as a course for building scalable data workflows and big data engineering pipelines.

The repository provides practical source code and project layouts that demonstrate how to connect external data stores, process streaming data, and organize code for distributed environments. It includes implementation examples for scaling machine learning algorithms across clusters to handle large tra
- [sindresorhus/awesome](https://awesome-repositories.com/repository/sindresorhus-awesome.md) (476,211 ⭐) — This project is a community-maintained directory that serves as a comprehensive index of software tools, frameworks, and educational materials. It functions as an open-source knowledge base, organizing diverse engineering domains and technical resources into a structured taxonomy to assist developers in discovering high-quality content.

The directory distinguishes itself through a decentralized peer-review model, where independent contributors curate, verify, and update entries to ensure accuracy and relevance. All information is stored in a version-controlled, flat-file markdown format, whic
- [cockroachdb/cockroach](https://awesome-repositories.com/repository/cockroachdb-cockroach.md) (32,207 ⭐) — Cockroach is a distributed SQL database designed to scale horizontally across multiple nodes while maintaining strict ACID compliance and global data consistency. It functions as a relational database engine that automatically partitions data into ranges, rebalancing them across a cluster to accommodate growing storage and throughput requirements. By utilizing a distributed consensus protocol, the system ensures that all nodes agree on the order of operations, providing fault tolerance and continuous availability even in the event of hardware failures.

The system distinguishes itself through
- [benstew/blockchain-for-software-engineers](https://awesome-repositories.com/repository/benstew-blockchain-for-software-engineers.md) (800 ⭐) — Inspired by Google Interview University, Machine Learning for Software Engineers, The Authoritative Guide to Blockchain Development
- [ray-project/ray](https://awesome-repositories.com/repository/ray-project-ray.md) (42,895 ⭐) — Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls.

The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
- [camel-ai/camel](https://awesome-repositories.com/repository/camel-ai-camel.md) (17,253 ⭐) — This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer.

The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
- [juliastats/distributions.jl](https://awesome-repositories.com/repository/juliastats-distributions-jl.md) (1,193 ⭐) — A Julia package for probability distributions and associated functions.
- [apache/beam](https://awesome-repositories.com/repository/apache-beam.md) (8,612 ⭐) — Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model.

The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
- [tangxiangmin/cocos-big-watermelon](https://awesome-repositories.com/repository/tangxiangmin-cocos-big-watermelon.md) (201 ⭐) — big watermelon by cocos
- [langfuse/langfuse](https://awesome-repositories.com/repository/langfuse-langfuse.md) (29,190 ⭐) — Langfuse is an open-source observability and evaluation platform designed for language model applications. It provides a centralized system for tracking execution traces, monitoring performance metrics, and managing prompt templates. By capturing hierarchical units of work and telemetry data, the platform enables developers to debug complex application lifecycles and analyze token usage, latency, and model interactions in production environments.

The platform distinguishes itself through an integrated evaluation framework that allows for systematic benchmarking and automated scoring of model
- [eventual-inc/daft](https://awesome-repositories.com/repository/eventual-inc-daft.md) (5,225 ⭐) — Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types.

The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions witho
- [eto-ai/lance](https://awesome-repositories.com/repository/eto-ai-lance.md) (6,671 ⭐) — Lance is a versioned columnar data format and storage engine designed as a multimodal AI lakehouse. It serves as a vector database storage engine and a cloud object store dataset manager, organizing images, video, audio, and embeddings into a unified format optimized for machine learning workflows.

The project distinguishes itself by combining a columnar layout for structured data with a specialized blob store for large multimodal tensors. It implements a hybrid search engine that integrates vector similarity search, full-text search, and SQL analytics on a single dataset, supported by a stor
- [tensorflow/datasets](https://awesome-repositories.com/repository/tensorflow-datasets.md) (4,575 ⭐) — TensorFlow Datasets provides many public datasets as tf.data.Datasets.
- [b4rtaz/distributed-llama](https://awesome-repositories.com/repository/b4rtaz-distributed-llama.md) (2,837 ⭐) — Distributed-llama is a distributed inference engine and command line tool for running large language models across multiple networked machines. It functions as a compute cluster manager that coordinates worker nodes to share the computational load of a single model.

The system utilizes tensor parallelism to shard model weights across different hosts, allowing the execution of models that exceed the memory capacity of a single piece of hardware. It includes a dedicated format converter to transform standard model files into a compatible binary layout optimized for distributed loading.

The eng
- [juliastats/dataframes.jl](https://awesome-repositories.com/repository/juliastats-dataframes-jl.md) (1,830 ⭐) — In-memory tabular data in Julia
- [donnemartin/system-design-primer](https://awesome-repositories.com/repository/donnemartin-system-design-primer.md) (353,387 ⭐) — This project is a comprehensive educational resource and study guide focused on distributed systems architecture and backend infrastructure design. It provides a structured curriculum for mastering the principles of scalability, reliability, and performance required to design complex software systems.

The repository distinguishes itself by offering a methodical approach to technical interview preparation, incorporating design patterns, architectural trade-offs, and spaced repetition tools to help users retain complex concepts. It emphasizes constraint-driven analysis, teaching users how to ev
- [databricks/spark-the-definitive-guide](https://awesome-repositories.com/repository/databricks-spark-the-definitive-guide.md) (3,099 ⭐) — This project is an educational resource and technical manual for Apache Spark, focused on the architecture and practical application of large-scale data processing. It serves as a guide for big data engineering and distributed computing, covering the principles of parallel processing and fault-tolerant data distribution.

The material provides instructional content on designing distributed ETL pipelines and implementing data analysis workflows. It includes tutorials for polyglot data processing, offering patterns and examples for using Python, Scala, and Java within a unified environment.

The
- [juliadata/dataframes.jl](https://awesome-repositories.com/repository/juliadata-dataframes-jl.md) (1,830 ⭐) — In-memory tabular data in Julia