61 مستودعات
Frameworks and methodologies for analyzing and processing massive volumes of data across distributed systems.
Distinct from Big Data Processing: Shortlist candidates were primarily from 'awesome-lists' or too focused on specific memory algorithms.
Explore 61 awesome GitHub repositories matching data & databases · Big Data Processing. Refine with filters or upvote what's useful.
This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The repository features specialized instructional suites for implementing classical machine learning algorithms, building deep learning model architectures, and managing AWS cloud infrastructure. It includes dedicated notebooks for data visualization and numerical computing exercises. The project covers
Provides educational resources for performing large-scale distributed computing and file storage operations.
simdjson is a high-performance JSON parser that utilizes SIMD instructions to process gigabytes of data per second. It functions as a SIMD JSON parser, a multithreaded NDJSON processing library, a UTF-8 validation engine, and a tool for JSON minification and string building. The project focuses on high-throughput data processing, enabling the ingestion of massive JSON volumes and the verification of UTF-8 encoding standards. It includes dedicated capabilities for constructing JSON strings with optimized memory usage and removing unnecessary whitespace from documents to reduce file size. The
Accelerates the ingestion of massive NDJSON volumes into analytics engines via multithreaded processing.
BigData-Notes is a big data learning resource and data engineering knowledge base. It provides a collection of guides, technical references, and documentation focused on the installation and configuration of distributed data processing technologies. The project covers a learning path for distributed systems, including the setup of large-scale data storage and computing clusters. It specifically addresses both batch and stream processing workflows and the implementation of data APIs for interacting with distributed messaging and storage systems. The materials are organized using markdown-base
Provides detailed instructions for setting up big data software stacks on servers for large-scale processing.
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Provides the primary infrastructure for managing, storing, and processing massive volumes of data across distributed systems.
Scala is a statically typed programming language and compiler that combines object-oriented and functional programming paradigms. It serves as a cross-platform runtime language capable of targeting the Java Virtual Machine and JavaScript to share logic between backend servers and web frontends. The project provides a functional programming framework with immutable data structures and higher-order functions to build reliable concurrent and distributed applications. It distinguishes itself through deep interoperability with Java and JavaScript ecosystems and the ability to transform code into n
Provides frameworks for analyzing and processing massive volumes of data across distributed systems.
Nebula is a distributed graph database designed for storing and querying massive volumes of interconnected vertices and edges across a horizontally scalable cluster. It functions as a Kubernetes-native database and a distributed graph analytics engine, utilizing a Raft-based distributed store to ensure strong consistency and high availability. The system features an OpenCypher query engine for performing complex graph traversals and pattern matching. It distinguishes itself with a decoupled compute-storage architecture and a shared-nothing distributed design, allowing query processing and dat
Ships a web-based explorer for composing schemas, importing data, and visually exploring graph relationships.
Azure Docs is the official technical documentation repository for Microsoft Azure, the cloud computing platform. It provides comprehensive guidance on the full spectrum of Azure services, covering everything from core infrastructure components like virtual machines, Kubernetes clusters, and serverless computing to platform services for AI, machine learning, data analytics, and storage. The documentation details how to provision, manage, and govern cloud resources at scale, including policy enforcement, identity management, and cost optimization. The documentation distinguishes Azure through i
Documents Azure's big data services like HDInsight and Synapse Analytics for processing massive datasets.
FiftyOne هي أداة بصرية لتنظيم وتحليل وإدارة مجموعات بيانات الصور والفيديو لتدريب نماذج التعلم الآلي. تعمل كمنصة لتحديد أخطاء التعليقات التوضيحية، وتحسين تسميات الحقيقة الأرضية (ground truth)، وتقييم أداء نماذج الرؤية من خلال مقارنة التنبؤات بالحقيقة الأرضية لتحديد أنماط الفشل. يعمل النظام كمنصة بيانات بالحاويات تدعم تعاون الفريق على مجموعات البيانات البصرية واسعة النطاق في بيئة سحابية. ويتضمن قدرات متخصصة لاستكشاف التضمينات عالية الأبعاد لاكتشاف مجموعات البيانات واسترداد العينات البصرية المقابلة. تغطي المنصة مجموعة واسعة من القدرات بما في ذلك التعليقات التوضيحية للبيانات ثنائية وثلاثية الأبعاد، والتحقق من جودة مجموعة البيانات، واستكشاف البيانات البصرية. وتتكامل مع أطر عمل التعلم العميق لنقل البيانات من التنظيم إلى تدريب النموذج وتستخدم مخزن بيانات وصفية قائم على المستندات لإدارة هياكل مجموعات البيانات.
Provides an interactive visual interface for browsing and analyzing large-scale image and video datasets.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Handles the lifecycle of Iceberg tables, including catalog management and automated compaction.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Creates and manages tables that simultaneously store vector embeddings and scalar metadata.
Iceberg is an open table format and big data table manager designed for huge analytic datasets in cloud storage. It provides a specification for tracking large-scale datasets to maintain transactional consistency and structural integrity. The project utilizes a standardized REST catalog interface to manage table metadata, ensuring interoperability between different compute engines. This allows diverse query engines to connect to a single table interface and maintain consistency across different processing frameworks. Its core capabilities include managing large-scale analytic tables, coordin
Provides a comprehensive system for managing massive analytic datasets and coordinating concurrent read/write operations across multiple engines.
This project is a cloud data analysis sandbox and a collection of courseware designed for learning data analysis techniques on Google Cloud Platform. It serves as a training lab containing technical demonstrations and practical exercises for skill development and cloud certification. The repository provides guided labs and demonstrations focused on Google Cloud data analysis, encompassing technical training for the platform's specific data services. It enables the practice of cloud data engineering and the use of big data tooling to perform queries and data transformations. The environment s
Enables practice with cloud-native big data tools for performing complex queries and data transformations.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Provides a system for analyzing and visualizing billions of rows of tabular data within interactive notebooks.
Moto is a cloud service mockery framework and API mock server that simulates AWS infrastructure locally. It allows developers to test cloud-dependent code and verify infrastructure-as-code templates without deploying real resources or incurring costs. The project functions as an SDK interceptor that can patch existing service clients to redirect requests to a local mock environment. It can also be run as a standalone HTTP server, enabling any programming language to interact with the simulated endpoints. The framework covers a vast array of simulated capabilities, including data storage, com
Simulates the organization and coordination of massive datasets via table and namespace management.
This project is a collection of pre-configured Docker images that provide ready-to-run environments for interactive computing and data science. It functions as a scientific computing stack and a polyglot notebook server, bundling language interpreters and libraries for Python, R, and Julia within a containerized system to ensure reproducible research environments. The collection uses a layered image hierarchy to provide versioned software dependencies and support for hardware acceleration across different CPU architectures. It allows for the creation of custom images based on a foundation of
Provides distributed binaries and language support for large-scale data processing with Spark.
A/B Street is an open-source traffic simulation and urban planning tool that models how cars, bikes, and pedestrians move through real-world street networks. It imports data from OpenStreetMap to build detailed, lane-level road models, then runs discrete-event simulations to analyze travel times, delays, and congestion patterns across different infrastructure scenarios. The project provides an interactive map editor for modifying road geometry, lane configurations, traffic signals, and access restrictions, with full undo/redo support. Users can design low-traffic neighborhoods by placing moda
Displays per-agent routes, scatter plots of intersection delays, and sortable trip tables for aggregate analysis of simulation results.
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the
Covers frameworks and methodologies for splitting and processing massive datasets concurrently across distributed systems.
DevOps-Bash-tools is a collection of shell scripts and aliases designed to automate cloud infrastructure, container orchestration, and CI/CD pipelines. It provides a comprehensive toolset for managing operational workflows through the command line. The project specializes in automating tasks across multiple platforms, including managing namespaces and secrets in Kubernetes, auditing resources in AWS and GCP, and triggering builds or managing environment variables in GitHub Actions, GitLab CI, and CircleCI. It also includes a toolkit for interacting with container registries to query manifests
Simplify connectivity and metadata extraction for big data components.
vis is a JavaScript data visualization library used to render interactive networks, timelines, and graphs directly in the web browser. It functions as a relational data mapper and browser-based charting tool, turning complex structured data into dynamic visual patterns to expose entity relationships. The library provides specialized tools for force-directed network graphs, where relational data is represented as interactive nodes and edges. It also includes an interactive timeline component for plotting chronological events and time intervals on a scalable temporal axis. The project covers b
Enables graphical analysis and exploration of complex relational datasets through interactive network visualizations.
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
Connects with distributed processing systems to handle large datasets and incorporate machine learning workflows.