7 مستودعات
Utilities for transforming, aggregating, and analyzing raw data streams.
Distinguishing note: Focuses on server-side computation, distinct from client-side event collection.
Explore 7 awesome GitHub repositories matching data & databases · Data Processing. Refine with filters or upvote what's useful.
Umami is a self-hosted, privacy-focused web analytics platform designed to provide full control over infrastructure and user data. It captures website traffic and visitor behavior through anonymous tracking methods that avoid cookies, browser fingerprinting, and the storage of personally identifiable information. The platform distinguishes itself through a comprehensive suite of behavioral analysis tools, including session replays, heatmaps, and cohort-based retention reporting. It features a multi-tenant architecture that allows teams to manage multiple websites within a single, collaborativ
Aggregates raw event logs into meaningful insights on the server to minimize client-side overhead.
Vector is a high-performance observability data pipeline designed to collect, transform, and route logs, metrics, and traces across distributed infrastructure. It functions as a modular engine that decouples data ingestion from processing and transmission, utilizing a component-based architecture to connect diverse sources to multiple destinations. The project distinguishes itself through a focus on reliability and flow control. It implements backpressure-aware data movement to prevent data loss during traffic spikes and utilizes disk-backed event buffering to ensure durability during network
Performs local data aggregation to reduce network traffic and compute load before forwarding to global nodes.
This project provides a comprehensive guide to architectural patterns and best practices for building scalable, maintainable, and performant web applications using FastAPI. It focuses on standardizing development approaches for Python web services, emphasizing robust request validation, dependency injection, and automated documentation standards to ensure consistent API design. The guide distinguishes itself by promoting domain-driven modular packaging, which organizes application logic into isolated, feature-based directories to support long-term codebase scalability. It also details strateg
Performs complex data joins and aggregations directly within the database engine for native performance.
This project is a high-performance MQTT broker and IoT data platform designed to manage millions of concurrent device connections. It provides a scalable infrastructure for ingesting, processing, and routing telemetry data across distributed systems, utilizing an actor-based concurrency model to maintain high availability and state synchronization across cluster nodes. The platform distinguishes itself through integrated stream processing and edge computing capabilities. It allows users to execute declarative SQL-based rules directly against incoming message streams for real-time filtering, t
Filters, aggregates, and transforms data streams locally to reduce bandwidth consumption and enable low-latency responses.
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
Runs custom serverless code during object requests to filter or modify data in real-time.
lakeFS هو نظام إصدارات لبحيرات البيانات يوفر تفرعاً (branching) والتزامات (commits) تشبه Git لمجموعات البيانات الكبيرة المخزنة في تخزين الكائنات. يعمل كطبقة تحكم في الإصدار، مما يتيح إنشاء لقطات غير قابلة للتغيير، والتزامات ذرية، وتفرعاً بدون نسخ (zero-copy) لإنشاء بيئات معزولة لتجارب البيانات دون تكرار الملفات الفيزيائية. يعمل النظام كبوابة تخزين متوافقة مع S3 وفهرس Iceberg REST، مما يسمح لبروتوكولات التخزين السحابي القياسية والعملاء المتوافقين بإدارة الجداول ذات الإصدارات. يعمل كحارس لجودة البيانات باستخدام نظام خطافات (hooks) قائم على الأحداث للتحقق من مجموعات البيانات مقابل سياسات الحوكمة قبل دمج التغييرات في الإنتاج. تغطي المنصة قدرات واسعة لحوكمة البيانات، بما في ذلك التعاون عبر طلبات السحب (pull requests)، والتحكم في الوصول القائم على الأدوار، وتتبع أصل البيانات. يوفر تكاملاً لتنسيق سير العمل، وخطوط أنابيب التعلم الآلي، ومحركات حوسبة البيانات الضخمة المختلفة، ويدعم اتصال التخزين متعدد السحابة ومزامنة الهوية عبر SSO وSCIM. يمكن تثبيت البرنامج باستخدام ملفات ثنائية، أو حاويات، أو Helm charts للنشر على Kubernetes.
Updates embeddings by processing only the added, removed, or modified data between two commits.
This project is a C++ learning resource and study guide consisting of structured notes and programming examples. It provides practical implementations and exercise solutions covering core language syntax, data types, and control flow. The repository features specialized samples for object-oriented design, including class inheritance, polymorphism, and abstract classes. It includes demonstrations of memory management techniques such as dynamic allocation, move semantics, and placement new, as well as template programming examples for creating generic functions and data structures. The codebas
Implements logic to aggregate and calculate totals from multidimensional grid-based data structures.