30 open-source projects similar to oxnr/awesome-bigdata, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Awesome Bigdata alternative.
This project is a community-curated directory of open-source software designed for deployment in private server environments and home labs. It serves as a comprehensive resource for discovering independent, self-hosted alternatives to mainstream cloud services, enabling users to maintain full data ownership and control over their digital infrastructure. The directory is structured through a hierarchical taxonomy that organizes a vast collection of applications into logical categories, ranging from media management and data analytics to private communication and team productivity tools. It dis
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
This project is a community-maintained directory that serves as a comprehensive index of software tools, frameworks, and educational materials. It functions as an open-source knowledge base, organizing diverse engineering domains and technical resources into a structured taxonomy to assist developers in discovering high-quality content. The directory distinguishes itself through a decentralized peer-review model, where independent contributors curate, verify, and update entries to ensure accuracy and relevance. All information is stored in a version-controlled, flat-file markdown format, whic
This project is a curated directory and reference library of open-source Python applications. It serves as a comprehensive index designed to help developers study real-world software architecture, design patterns, and practical implementation strategies through a diverse collection of community-driven projects. The repository distinguishes itself by focusing on the analysis of production-ready software patterns rather than providing a single tool. It offers a structured way to explore how complex features, such as modular plugin systems, configuration management, and various deployment strate
This project serves as a comprehensive, community-driven directory for the .NET ecosystem. It functions as a curated index of high-quality libraries, frameworks, and tools designed to assist developers in identifying recommended solutions for a wide range of project requirements and software engineering tasks. The repository distinguishes itself by providing a categorized catalogue that simplifies the discovery of resources across the entire .NET development lifecycle. By maintaining a standardized collection of community-contributed utilities, it helps developers navigate the ecosystem to fi
This project is a curated resource repository and learning platform dedicated to creative coding, generative art, and graphics programming. It serves as a comprehensive directory for developers and artists seeking to build interactive media, procedural visuals, and real-time digital experiences. The collection distinguishes itself by aggregating a wide range of technical tools, frameworks, and educational materials necessary for mastering creative technology. It provides access to specialized graphics libraries, shader editors, and hardware interfaces, while also offering guidance on the math
This project serves as a comprehensive resource hub and curated directory for the FastAPI web framework ecosystem. It provides developers with a centralized collection of community-vetted libraries, tools, and best practices designed to support the development, testing, and deployment of scalable web services using modern Python. The repository distinguishes itself by aggregating resources that address the full lifecycle of high-performance API development. It covers essential capabilities including project scaffolding, database integration, and the implementation of real-time communication p
This project is a curated directory of resources, libraries, and frameworks designed to support the development, training, and deployment of neural network models. It serves as a comprehensive guide for navigating the machine learning ecosystem, providing structured access to software utilities and research materials. The directory distinguishes itself by aggregating tools across the entire machine learning lifecycle, ranging from data management and experiment tracking to production-ready model deployment. It functions as a central hub for discovering both foundational academic research and
Databend is a cloud-native data warehouse and OLAP database designed for large-scale analytics. It functions as a SQL-compliant engine and serverless analytics platform that separates compute from storage to allow for independent scaling. The system integrates vector database capabilities, indexing high-dimensional embeddings to enable semantic, hybrid, and full-text searches across massive datasets. It further distinguishes itself through serverless compute management that automatically scales resources based on demand and shuts them down during idle periods. The platform covers a broad set
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
TDengine is a distributed time-series database designed for the high-speed ingestion, compression, and retrieval of timestamped metrics and sensor data. It functions as a SQL-compatible analytics engine, allowing users to perform complex operations on massive volumes of time-ordered information using standard relational syntax. The platform is built to serve as a backend foundation for industrial IoT environments, managing real-time data streams and device metadata through a cluster-based architecture. The system distinguishes itself through a distributed sharding architecture that uses consi
This project is a comprehensive learning resource and reference guide for software architecture and distributed systems design. It serves as a structured curriculum for engineers to study fundamental architectural patterns, scalability strategies, and distributed computing theory, specifically tailored to prepare for technical interviews and professional engineering roles. The repository distinguishes itself by providing a curated collection of industry-standard infrastructure tools and methodologies. It covers the selection and implementation of technologies for data storage, message brokeri
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
This project is a community-maintained, open-access directory of high-quality public datasets. It serves as a centralized reference point for researchers, developers, and data scientists to locate reliable information sources across a wide spectrum of industries and scientific fields. By providing a structured index, the repository facilitates the discovery of data necessary for exploratory analysis, machine learning model training, and the development of data-intensive applications. The directory distinguishes itself through a lightweight, platform-agnostic approach to resource indexing that
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
This project is a comprehensive, community-driven directory of machine learning resources, software libraries, and educational materials. It serves as a centralized knowledge base for developers and researchers, organizing tools and frameworks by their primary programming language and technical domain to simplify discovery across the artificial intelligence ecosystem. The collection distinguishes itself by providing a cross-language development index that spans diverse programming environments, including C, C++, Rust, Clojure, and Python. It covers a wide range of specialized capabilities, fr
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
OpenSearch is a distributed search and analytics engine designed for indexing, searching, and analyzing massive volumes of structured and unstructured data in real time. It functions as a comprehensive platform that integrates enterprise-grade search capabilities, a vector database for high-dimensional similarity lookups, and a unified observability suite for monitoring logs, metrics, and traces across complex distributed environments. The platform distinguishes itself through its support for agentic workflow automation, allowing users to orchestrate multi-agent tasks and integrate foundation
This project is a centralized directory and resource manager for artificial intelligence-driven software engineering tools. It functions as a curated registry that aggregates extensions, automated workflows, and development resources, providing a structured database for developers to discover and implement AI-assisted coding solutions. The system distinguishes itself through a suite of automated maintenance utilities that ensure the integrity and currency of the curated data. It employs background processes to validate external links, synchronize remote repository information, and manage reso
This project is a community-driven knowledge repository and software resource directory focused on artificial intelligence and professional productivity tools. It functions as a markdown-based knowledge base that organizes information into a hierarchical taxonomy, allowing users to discover, compare, and evaluate software solutions based on specific business and technical requirements. The platform distinguishes itself through a decentralized peer-review model, where the directory is maintained and updated by the community via a pull-request workflow. This collaborative approach ensures that
HelloGitHub is a centralized discovery platform and technical knowledge repository designed to help developers identify high-quality open-source projects, libraries, and infrastructure. It functions as a structured directory that aggregates specialized development tools and educational materials, organizing them by technical domain to facilitate efficient resource discovery and professional development. The platform distinguishes itself through a community-driven curation workflow, where manual editorial oversight filters the broader software ecosystem into thematic collections. This content
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
This project functions as a curated software directory and developer resource index, providing a centralized platform for discovering and evaluating high-quality open-source repositories. It serves as an aggregator that monitors trending software and educational resources, organizing them by technical domain and programming language to assist developers in identifying tools for their specific technical challenges. The directory distinguishes itself through a community-driven curation workflow, where repository lists are validated and updated based on collective developer consensus. This infor
This project serves as a curated directory and resource hub for developers working with generative artificial intelligence. It provides a comprehensive index of open-source software solutions, frameworks, and project examples designed to help users discover and implement advanced AI systems. The repository focuses on practical implementations of agentic, multimodal, and retrieval-augmented generation architectures. It highlights tools for building conversational assistants, voice-enabled agents, and automated workflows that leverage large language models. By showcasing diverse technical domai
Elasticsearch is a distributed search engine and document store designed for the high-performance indexing and retrieval of massive volumes of unstructured data. It functions as a centralized analytics platform, providing a schema-flexible architecture that organizes information into searchable indices while maintaining global cluster state through a distributed consensus mechanism. The platform distinguishes itself through its integrated approach to observability, security, and advanced analytics. It combines full-text, vector, and hybrid search capabilities with machine learning-driven insi
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models using external storage and metadata pointers. It integrates with Git by utilizing placeholders to keep heavy artifacts out of the repository while maintaining a versioned link between code and data. The system manages remote data caches through a synchronization layer that connects local environments to cloud storage or network filesystems. It also functions as an experiment tracker, recording hyperparameters and metrics to compare the performance of different model iterations.
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
ParadeDB is a database extension that integrates full-text search, vector database capabilities, and real-time analytics directly into a relational engine. It functions as a plugin that adds new storage and query execution capabilities to an existing database architecture. The project distinguishes itself by supporting hybrid search workflows that combine lexical keyword matching with dense and sparse vector similarity in a single query. It utilizes reciprocal rank fusion to merge these ranked result sets and employs logical replication to synchronize data from external instances, removing th