38 Repos
Tools for synchronizing analytics metrics into centralized data warehouses.
Distinguishing note: Focuses on long-term storage and warehouse synchronization rather than real-time reporting.
Explore 38 awesome GitHub repositories matching data & databases · Data Warehouse Integrations. Refine with filters or upvote what's useful.
This project is an open-source, privacy-focused web analytics platform designed for high-throughput data ingestion and multi-tenant data management. It provides a cookie-less tracking engine that captures visitor interactions using ephemeral request metadata, ensuring comprehensive traffic visibility while maintaining strict privacy standards. The architecture utilizes an event-driven ingestion pipeline and aggregated metric storage to decouple data collection from processing, enabling efficient long-term retrieval and responsive dashboard performance. What distinguishes this platform is its
Pipes analytics information into external data storage systems to support long-term data warehousing and complex analytical pipelines.
Beekeeper Studio is a cross-platform desktop application designed for database management and SQL development. It provides a unified graphical interface to connect to, query, and modify data across a wide range of relational and NoSQL database systems. The application functions as a comprehensive workspace, integrating tools for schema design, record editing, and data visualization. The project distinguishes itself through a focus on secure, flexible connectivity and AI-assisted workflows. It supports advanced authentication methods, including enterprise single sign-on, multi-factor authentic
Establishes secure connections to managed cloud database services and data warehouses using enterprise authentication.
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Provides secure connectivity for executing SQL queries and managing datasets in cloud data warehouses.
Cube is a semantic data layer that provides a unified framework for defining business metrics, dimensions, and relationships across diverse data sources. By acting as a headless business intelligence engine, it transforms raw data into a governed model that can be queried via SQL, REST, and GraphQL interfaces. This architecture ensures consistent data definitions and logic across all downstream analytical applications and reporting tools. The platform distinguishes itself through its integrated conversational AI capabilities, which allow users to explore data using natural language. It orches
Provides secure connectivity to enterprise data warehouses for consistent analytics.
Plotly.py is a comprehensive framework for building production-ready data applications and interactive dashboards directly from Python code. It functions as both a high-performance visualization library for browser-based charts and a full-stack tool for transforming analytical scripts into responsive, web-based interfaces. By abstracting away the need for manual HTML or JavaScript, it allows developers to define complex layouts and functional logic using modular, reusable components. The framework distinguishes itself through a robust architecture that handles event orchestration and state sy
Links analytical applications to external data warehouses to enable automated processing and reporting across infrastructure.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Provides secure, real-time interactive connectivity to cloud-hosted data warehouse clusters for SQL analysis.
Quarkus is a Kubernetes-native Java framework designed for building high-performance, memory-efficient applications. It utilizes ahead-of-time native compilation to transform Java code into standalone, optimized binaries that eliminate the need for a virtual machine, enabling rapid startup and reduced memory consumption. By performing code augmentation during the build phase, it shifts heavy processing tasks away from runtime, ensuring that applications are optimized for cloud-native environments. The framework distinguishes itself through a unified approach to reactive and imperative program
Integrates managed relational database services like MySQL and PostgreSQL for persistent data storage.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Transfers processed document data into specified databases, schemas, and tables within cloud-based data warehouses.
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
Establishes secure connectivity between the transformation engine and various cloud-hosted data warehouse platforms.
Nakama is a distributed server framework designed for real-time multiplayer games and social applications. It provides an authoritative runtime environment for executing game logic, ensuring consistent state and cheat-resistant gameplay across diverse client platforms. The system acts as a centralized backend, managing persistent player identities, social graphs, and real-time communication channels to support complex multiplayer interactions. The platform distinguishes itself through an integrated suite of LiveOps tools that allow developers to manage game economies, schedule time-bound even
Streams raw player and system event data to external data warehouses for analytics.
Debezium is a distributed change data capture platform that streams row-level database modifications as real-time events. By parsing database transaction logs, the system broadcasts structural and data changes to message brokers, enabling reactive processing and data integration across distributed architectures. The platform utilizes log-based capture to extract modifications directly from transaction logs, ensuring minimal impact on source system performance while maintaining the original commit order of operations. It employs database-specific connector adapters to translate proprietary bin
Maintains analytical stores by streaming live database updates into data warehouses for real-time intelligence.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Executes generated queries across multiple warehouse types using standard drivers while maintaining a unified interface.
go-cloud is a toolkit of cloud-agnostic libraries that provide portable Go interfaces for interacting with common cloud services. It enables multi-cloud application development by decoupling business logic from specific provider API implementations. The project utilizes a driver-based system to map generic interface calls to vendor-specific requests. This allows applications to switch between different cloud backends for blob storage, relational databases, and asynchronous publish-subscribe messaging without changing the core application code. Beyond storage and messaging, the toolkit includ
Integrates relational database services using portable connectors that prevent vendor lock-in.
PyCaret is a Python AutoML platform and MLOps lifecycle manager designed to automate machine learning workflows. It functions as a low-code environment that leverages a scikit-learn native engine to execute preprocessing, training, and evaluation for tabular data. The platform distinguishes itself as an LLM-powered ML copilot, using large language model agents to analyze datasets, design experiment configurations, and explain model results. It also serves as a Kubernetes ML orchestrator and model registry, enabling the versioning of trained pipelines and their promotion to production API endp
Provides secure connectivity to cloud-hosted data warehouses and managed database services for importing experimentation data.
Metaflow is a Python machine learning framework and MLOps workflow orchestrator designed to manage the lifecycle of data pipelines from local prototyping to production. It serves as a distributed compute manager and an experiment tracking system, enabling the creation of reproducible pipelines that transition between development and high-availability production environments. The framework distinguishes itself through an integrated checkpointing system that automatically persists intermediate data artifacts to remote storage, allowing failed runs to be resumed from the last successful step. It
Writes predictions and computation outputs to data warehouses or caches to power downstream systems.
JimuReport is an open-source reporting and dashboard engine designed to be embedded directly into Spring Boot applications. Its core identity centers on generating data reports and full-screen dashboards from natural language descriptions, eliminating the need for manual design. The platform also provides a conversational query interface that translates plain-language questions into database queries, returning results as tables and charts without requiring SQL knowledge. What distinguishes JimuReport is its integration of AI skills that can be installed with a single command, enabling report
Connects to Apache Doris data warehouse as a data source for reports and dashboards.
This project is a plugin framework and agentic workflow library designed to connect large language models to professional toolstacks. It provides a system for integrating language models with external data warehouses, CRMs, and other enterprise software to retrieve and manipulate real-time business data. The framework enables the automation of specialized professional tasks through a file-based plugin definition system. It allows for the customization of domain expertise and plugin behavior to align with internal company processes, supported by an enterprise data connector that links models t
Provides secure connectivity modules for linking language models to cloud-hosted data warehouses and BI tools.
Realtime is a real-time data distribution and synchronization engine that enables applications to stream database changes and coordinate state between clients. It functions as a synchronization layer that monitors database write-ahead logs to provide change data capture and pushes updates to authorized clients via WebSockets. The project features a real-time presence server for tracking the online status of active users and a broadcast service for sending ephemeral messages without database persistence. It organizes communication through channel-based message routing and uses a structured JSO
Streams database changes to external data warehouses in real time without manual pipelines.
GrowthBook is a feature flagging and experimentation platform that utilizes a warehouse-native approach to data analysis. It serves as a system for managing feature rollouts and conducting A/B tests by executing SQL queries directly against existing data warehouses to calculate experiment results. The platform is distinguished by its integration of a Model Context Protocol server, which allows AI coding assistants and IDEs to manage flags and query analytics using natural language. It also provides specialized capabilities for AI model optimization, enabling the testing of prompts and models
Provides secure connectivity to external data warehouses to enable warehouse-native analysis and experimentation.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Converts retrieved feature data into dataframes, Arrow tables, SQL, data lakes, or data warehouses for downstream use.