13 مستودعات
Platforms designed for large-scale data storage and high-performance analytical query execution.
Distinguishing note: None available; no candidates provided.
Explore 13 awesome GitHub repositories matching data & databases · Data Warehousing. Refine with filters or upvote what's useful.
ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring. The platform distinguishes itself through ad
Enables storage and analysis of large-scale datasets with high-performance query execution and optimized infrastructure costs.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Provides platforms designed for large-scale data storage and high-performance analytical query execution.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Handles thousands of simultaneous analytical queries per second for enterprise-scale workloads.
Databend is a cloud-native data warehouse and OLAP database designed for large-scale analytics. It functions as a SQL-compliant engine and serverless analytics platform that separates compute from storage to allow for independent scaling. The system integrates vector database capabilities, indexing high-dimensional embeddings to enable semantic, hybrid, and full-text searches across massive datasets. It further distinguishes itself through serverless compute management that automatically scales resources based on demand and shuts them down during idle periods. The platform covers a broad set
Implements a serverless data warehouse architecture that scales compute automatically and separates it from storage.
Connect is a Kafka data integration platform and stream processing engine used to build declarative pipelines that move and transform messages between Kafka topics and external sources. It functions as a Kafka Connect framework and a change data capture tool, streaming real-time database modifications to synchronize data across distributed environments. The project differentiates itself through a dedicated mapping language for mutating and reshaping message payloads and the ability to execute custom processing logic within a sandboxed WebAssembly runtime. It also provides an observability pip
Syncs streaming data to large-scale analytics warehouses and table catalogs for high-performance analytical queries.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Unifies real-time streaming and historical batch datasets into a single queryable interface for consistent business intelligence.
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Provides a SQL-on-Hadoop data warehouse for querying petabytes of distributed data.
JanusGraph is a distributed, elastically scalable graph database designed to store and query highly connected data across a cluster of machines. It supports the property graph data model with ACID consistency and integrates multi-model search capabilities including geo, numeric range, and full-text queries. The database also includes a Graph OLAP engine for running batch analytics and global graph computations on large datasets using the Hadoop framework. The project distinguishes itself through a masterless cluster architecture that eliminates single points of failure, allowing every node to
Runs full-graph processing jobs as MapReduce or Spark tasks on a Hadoop cluster for offline computation.
HBase هو مخزن NoSQL موزع واسع الأعمدة ومحرك تخزين بيانات ضخمة مصمم لمجموعات البيانات المتفرقة. يعمل كقاعدة بيانات عمودية قابلة للتوسع مبنية فوق نظام ملفات Hadoop الموزع لتوفير وصول للقراءة والكتابة في الوقت الفعلي لأحجام هائلة من البيانات المهيكلة وغير المهيكلة. يعمل النظام كبوابة قاعدة بيانات عبر اللغات، ويوفر الاتصال من خلال استدعاءات الإجراءات البعيدة الأصلية، وREST، وواجهات Thrift. ويتميز بنموذج تنسيق رئيس-عامل يتيح التوسع الأفقي وتحمل الأخطاء عبر العنقود. يغطي المشروع مجموعة واسعة من الإمكانيات بما في ذلك التحكم الدقيق في الوصول عبر تسميات الرؤية على مستوى الخلية، وضغط البيانات القابل للتوصيل، وتجميع البيانات من جانب الخادم. كما يدعم سير عمل تحليلات البيانات الضخمة من خلال تكامل map-reduce ويسمح بتنفيذ منطق مخصص من جانب الخادم. يتم توفير المراقبة التشغيلية من خلال تتبع مقاييس النظام وتصدير المقاييس القائم على الإضافات.
Implements a distributed NoSQL wide-column store built on top of the Hadoop ecosystem for sparse datasets.
Side-Menu.Android هو مكون واجهة مستخدم قابل لإعادة الاستخدام لتطبيقات Android يوفر درج تنقل منزلق. تم تصميمه لمساعدة المطورين على تنظيم أقسام التطبيق وخيارات المستخدم في لوحة مخفية منظمة تحافظ على واجهة نظيفة لمنطقة المحتوى الأساسية. يتميز المكون بعرضه المرئي، الذي يتبع إرشادات Material Design لضمان تجربة مستخدم متسقة وبديهية. يتميز بتسلسل هرمي للقائمة يعتمد على البيانات يسمح بالتجميع المنطقي لعناصر التنقل، ويدمج رسوماً متحركة دائرية انسيابية لتوفير انتقالات مرئية مصقولة عند فتح القائمة أو إغلاقها. من خلال تغليف منطق التخطيط والتفاعل المعقد في فئة واحدة معيارية، تبسط المكتبة تنفيذ التنقل عبر شاشات متعددة. تدعم الانتقالات المعتمدة على الأحداث، مما يسمح للمطورين بفصل تفاعلات القائمة عن تحديثات المحتوى للحفاظ على بنية تطبيق نظيفة وسريعة الاستجابة.
Builds data warehousing and analytics pipelines to process large datasets using scalable storage.
tech-vault is a command-line technical interview bank and knowledge base designed for practicing engineering questions across various technical domains. It functions as a terminal-based application that stores structured study materials and interview questions as markdown files, which are then rendered directly within the system console. The project distinguishes itself through a delivery model that uses command-line argument parsing to filter content by topic or difficulty. It also includes a random selection algorithm to pick individual questions from the collection for spontaneous study se
Offers practice materials covering data modeling, schema design, and data warehousing concepts.
OpenAddresses is an open-source geospatial data aggregator and directory that collects public domain and open-license address, parcel, and building datasets from governments and organizations worldwide. It functions as a global index and data warehouse for locating and distributing free geospatial records. The project operates a normalization pipeline that cleans and standardizes diverse source formats into a consistent global coordinate and attribute schema. This process includes a crowdsourced curation pipeline and programmatic quality validation to verify the spatial accuracy and formattin
Utilizes large-scale data storage to handle the global distribution of massive geospatial records.
cve-search is a vulnerability search engine and database manager designed to index, synchronize, and query CVE and CPE security vulnerability data. It functions as a security data warehouse that imports vulnerability feeds into a local database to enable fast, keyword-based discovery of security flaws. The project provides a web-based vulnerability browser and a programmatic JSON API for retrieving records and risk scores. It utilizes full-text indexing for vulnerability descriptions and implements an identity-verified security portal using the OpenID Connect standard for user authentication.
Functions as a security data warehouse by importing and indexing large sets of vulnerability information.