30 Repos
Tools for monitoring and enforcing data accuracy, completeness, and consistency standards.
Distinguishing note: Focuses on validation and reliability of data inputs rather than data processing or storage.
Explore 30 awesome GitHub repositories matching data & databases · Data Quality Frameworks. Refine with filters or upvote what's useful.
Codegraph is a local codebase indexer and static analysis graph database that serves as a context provider for AI agents. It parses multiple programming languages into a searchable knowledge graph of symbols and dependencies, exposing these relationships to AI tools through the Model Context Protocol. The project distinguishes itself by aggregating relevant code snippets and symbol flows to reduce token usage for large language models. It automates the configuration of server settings and steering instructions across various AI agent platforms and command line editors to enable automatic code
Formats complex graph data into markdown or JSON to ensure AI agents can efficiently consume codebase relationships.
Quivr is a framework for building retrieval-augmented generation pipelines that connect large language models to custom knowledge bases. It serves as a generative AI integration layer that abstracts the process of transforming diverse document sources into searchable context for AI responses. The project orchestrates the end-to-end flow between document ingestion, vector storage management, and model provider interfaces. It features a vector-store-agnostic retrieval system and a modular API layer that allows for flexible switching between different generative model providers. The system cove
Provides pipelines for parsing and converting raw files into searchable embeddings for AI knowledge bases.
Quiver is a framework for integrating retrieval augmented generation into applications. It provides a generative AI integration layer that connects large language models with vector stores to produce context-aware responses based on custom data. The project features a knowledge base pipeline that parses diverse file types into searchable embeddings and a vector database orchestrator to manage data across different storage implementations. It utilizes a provider-agnostic model interface, allowing users to switch between various external AI providers or local models through a single unified sys
Prepares and imports custom data from various file types to create high-quality knowledge bases for AI consumption.
OpenHuman is an AI application framework for building private intelligence systems and personal AI layers. It provides a system for deploying private AI assistants that execute technical tasks and manage personal knowledge bases. The project features a model-agnostic request proxy that routes AI workloads to different large language models based on requirements for reasoning, speed, or vision. It integrates an OAuth-driven data integrator to synchronize personal information from external services into a local knowledge base composed of hierarchical Markdown summaries. The framework also inclu
Implements a local database that converts third-party data into hierarchical Markdown summaries to serve as memory for AI models.
This project is a comprehensive, curated knowledge base designed to support the development and maintenance of production-grade machine learning systems. It serves as a centralized repository of industry-standard technical literature, engineering case studies, and research papers, providing a structured reference for practitioners navigating the complexities of modern data science and machine learning engineering. The resource distinguishes itself through a cross-domain approach that bridges the gap between academic research and practical implementation. By synthesizing proven industry archit
Monitor and enforce standards for data accuracy, completeness, and consistency to ensure reliable inputs for downstream analytical and machine learning processes.
DevOps-Roadmap is a comprehensive educational repository and knowledge base designed to guide technical professionals through the complexities of modern software engineering. It functions as a structured curriculum and reference library, covering the full spectrum of skills required to master system architecture, infrastructure management, and cloud operations. The project distinguishes itself by bridging the gap between high-level architectural design and the practical realities of engineering leadership. It provides curated insights into distributed systems, data consistency, and scalable d
Provides guidance on managing and preparing data ecosystems for AI-driven applications.
DocsGPT is a retrieval-augmented generation platform and private knowledge base used to build AI agents that perform grounded search and analysis. It functions as a multi-model AI orchestrator and enterprise agent builder, allowing for the integration of various local and cloud language models to customize reasoning and text generation. The project provides a visual environment for developing automated assistants using conditional logic and third-party API connectivity. It enables the creation of private AI agents capable of performing enterprise search and detailed document analysis using pr
Creates a searchable repository by ingesting documents, web pages, and audio files for AI consumption.
This project is a comprehensive framework for engineering financial data pipelines, designed to automate the collection, cleaning, and synchronization of large-scale market datasets. It functions as a quantitative trading data engine, providing the infrastructure necessary to manage historical and real-time asset pricing information for research and machine learning workflows. The system distinguishes itself through a configuration-driven approach to orchestration, allowing users to manage complex data acquisition tasks across multiple financial providers. It features resilient middleware tha
Enforces accuracy and consistency standards on market data inputs using logical invariants and schema requirements.
Memori is an AI agent memory middleware platform designed to provide persistent, context-aware recall for language models. It functions as a non-intrusive layer that intercepts outbound model requests to automatically capture interaction history and execution traces, ensuring that agents maintain continuity across sessions without requiring modifications to existing application logic. The platform distinguishes itself through a dual-model storage architecture that maintains information as both structured relational primitives for precise fact retrieval and rolling narrative summaries for situ
Processes raw dialogue into structured semantic triples and narrative summaries to create a searchable knowledge base.
LangBot is an orchestration platform designed for building, managing, and deploying AI agents. It functions as a comprehensive framework for integrating large language models with custom workflows, enabling developers to connect intelligent agents to various messaging platforms and external tools. The platform distinguishes itself through a modular, plugin-based architecture that allows for the extension of agent capabilities via custom tools and file parsers. It features a secure, sandbox-isolated runtime environment that executes untrusted code and plugin logic within resource-constrained c
Organizes and maintains collections of data that serve as the information source for AI bot responses.
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Integrates validation checks and automated policies directly into pipelines to ensure data integrity throughout the asset lifecycle.
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
Monitors data quality by identifying missing values, duplicates, and outliers.
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
Executes automated checks against database tables to verify data quality and business logic requirements.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Establishes formal agreements between data producers and consumers to enforce schema expectations and quality standards.
Great Expectations is a data quality testing framework and observability platform designed to monitor the reliability of data pipelines. It provides a structured environment for defining, documenting, and automating data quality assertions, allowing teams to validate datasets against expected structure and content before they move through downstream processes. The project distinguishes itself through a declarative domain-specific language that stores quality rules as version-controlled configuration files. It utilizes an execution engine abstraction to translate these high-level assertions in
Provides human-readable methods for defining declarative rules to validate data structure and content.
Quarkdown is a programmable document compiler and markdown static site generator. It transforms markdown source files into structured outputs, serving as a tool for generating professional books, academic papers, and digital presentations. The system distinguishes itself through a programmable layout engine that allows for the use of functions, variables, loops, and conditional logic within markdown files. It includes an interactive read-eval-print loop for testing these document functions and syntax in real time before final compilation. Additionally, it provides a specialized format for sup
Provides a specialized documentation format to supply offline wiki pages and API references to AI agents.
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
Configures whether AI systems use exclusively internal enterprise data or a combination of model and internal knowledge.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Stores external content as embeddings and metadata to serve as a high-performance knowledge base for AI agents.
Omniparse is a multimodal content parser and generative AI ingestion engine designed to convert documents, images, and multimedia into a uniform format. It functions as a data preprocessing pipeline that transforms diverse raw data sources into structured markdown to improve the performance of large language model workflows. The system extracts text and structural data from PDFs, images, audio, and video files. It includes a web crawler that converts dynamic website content into clean markdown and a multimodal transformation process that maps disparate input formats into a unified data schema
Builds standardized data flows to convert raw files into structured formats for AI knowledge bases.
OpenGPTs is a platform for building, deploying, and managing customizable AI assistants. It serves as an orchestrator that allows for the configuration of large language models with specific personas, cognitive architectures, and tool integrations. The system provides a complete lifecycle manager for AI agents, enabling the drafting of configurations, testing within sandboxes, and publishing assistants for public or internal distribution. It integrates a knowledge base interface using retrieval-augmented generation to attach documents to bots for context-aware responses. The platform covers
Provides an interface for uploading and querying documents within an AI-powered knowledge base.