32 repository-uri
Utilities that standardize heterogeneous data inputs into consistent schemas or unified formats for downstream analysis.
Explore 32 awesome GitHub repositories matching data & databases · Data Normalization and Schema Enforcement. Refine with filters or upvote what's useful.
OpenBB is a financial data platform and investment research terminal designed to aggregate, normalize, and distribute market data across analytical workflows. It functions as a comprehensive ecosystem that bridges disparate financial data providers with custom applications, spreadsheets, and internal modeling infrastructure. The platform distinguishes itself through a provider-based data abstraction layer that normalizes heterogeneous financial APIs into a consistent, schema-driven format. This architecture supports quantitative research automation and the construction of interactive, widget-
Enforces standardized data structures to ensure information from heterogeneous financial APIs remains consistent throughout the research pipeline.
This project is a command-line storage manager that provides a unified interface for performing file operations across local filesystems and diverse cloud storage providers. It functions as a cross-platform storage abstraction, utilizing a modular backend architecture to map heterogeneous cloud storage APIs into a standard set of file system operations. This allows for consistent data management and movement regardless of the underlying storage service. The tool serves as a network data transfer engine designed for automated data migration and cloud storage synchronization. It distinguishes i
Intercepts and modifies file attributes during transfer to match the requirements of the destination storage backend.
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Standardizes heterogeneous data sources into consistent structures to ensure schema uniformity across indexing and reasoning components.
FastMCP is a Python framework designed for building servers that expose functions, resources, and prompts to AI models using the Model Context Protocol. It simplifies the development process by automatically deriving tool metadata, input schemas, and documentation directly from Python function signatures and type hints. The framework provides a unified container for managing these components, allowing developers to build modular applications that integrate seamlessly with AI assistants. The project distinguishes itself through its support for interactive, server-defined user interface compone
Inlines local references and resolves root-level definitions to ensure schema compatibility.
Airbyte is a data integration platform designed to synchronize information between diverse applications, databases, and data warehouses. It functions as an extract, transform, and load orchestrator that manages automated data movement workflows across cloud, on-premise, and hybrid environments. The platform provides a standardized interface for connectors, enabling the movement of structured and unstructured data while maintaining stateful checkpoints for reliable incremental syncing. The platform distinguishes itself through a containerized architecture that isolates connectors to prevent de
Automatically maps raw incoming data to structured, typed schemas for downstream compatibility.
Figma-Context-MCP is a design-to-code automation tool that functions as a server for the Model Context Protocol. It acts as a bridge between visual design platforms and development environments, enabling large language models to access design file metadata and component properties directly. The project distinguishes itself by providing a standard-compliant interface that translates design specifications into structured data. By extracting layout and styling information, it facilitates the programmatic conversion of design tokens and component requirements into actionable code structures. Thi
Normalizes raw design properties into consistent interface definitions for automated environments.
Jackett is a self-hosted background service that functions as a BitTorrent tracker aggregator and proxy. It enables automated media management applications to query multiple torrent indexers simultaneously by translating standardized search requests into site-specific formats and consolidating the resulting data into a single, unified feed. The service distinguishes itself through an adapter-based architecture that handles the complexities of disparate tracker interfaces and security protocols. It integrates with external proxy services to bypass anti-bot challenges and maintain persistent ac
Intercepts and modifies metadata during data transfer processes to ensure accurate content matching.
YoutubeDownloader is a desktop application designed to retrieve and archive video and audio content from online platforms. It enables users to download media files directly to local storage, providing options to select specific quality levels and file formats to suit local playback requirements. The application distinguishes itself through its ability to access restricted or private content by utilizing personal account credentials. By managing session authentication, it allows for the retrieval of media that is not accessible to the general public. Furthermore, it incorporates automated work
Injects descriptive metadata and stream information into media containers after the download completes.
Amass is an attack surface management tool designed to identify, map, and inventory an organization's internet-facing digital assets. It functions as a security asset discovery engine that systematically expands an organization's known infrastructure footprint through recursive domain name resolution and the collection of intelligence from diverse public data sources. The platform distinguishes itself by utilizing a graph-based modeling approach to organize discovered resources. By maintaining a persistent graph database, it tracks the relationships between infrastructure components and norma
Standardizes disparate intelligence data into a unified schema for consistent analysis and reporting.
Nightingale is a Prometheus-compatible monitoring and alerting platform designed to centralize telemetry management across multiple time-series databases. It functions as a multi-source alerting engine and metric data pipeline that ingests telemetry via remote write protocols and triggers alarms based on data from sources such as Prometheus, Elasticsearch, Loki, and ClickHouse. The system is distinguished by its automated alert healing system, which executes predefined scripts and RPC-based corrective actions when monitoring thresholds are breached. It supports distributed alert processing, a
Transforms alert data through a sequence of relabeling, filtering, and metadata enrichment steps before notification.
Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin
Standardizes inputs, retrieved contexts, and generated responses into a consistent format for cross-platform performance analysis.
This project is a command-line forensic toolkit designed for the investigation and security auditing of mobile devices. It provides a framework for collecting system logs, application data, and forensic artifacts to identify potential security breaches, unauthorized access, or evidence of malicious activity. The utility employs a modular extraction architecture that parses diverse file formats and system logs into a standardized, normalized data structure. By utilizing this unified format, the tool performs both heuristic analysis of system metadata and pattern matching against structured thr
Normalizes raw device logs and backups into a standardized format for consistent cross-platform analysis.
Joyagent-jdgenie is an automated data orchestrator designed to centralize the retrieval and processing of information from disparate remote sources. It functions as a framework for building repeatable data pipelines that fetch, clean, and normalize raw input into consistent, structured formats. The system utilizes a schema-driven engine to apply validation rules and structural templates to incoming data, ensuring compatibility across enterprise systems. By employing configuration-based workflow definitions, it allows for the orchestration of modular tasks into automated execution flows, separ
Applies structural templates and validation rules to raw incoming information to ensure enterprise-wide consistency.
This project is a command-line utility designed to monitor and analyze token consumption and financial expenditure for AI coding assistants. By parsing local session logs directly on the user's machine, it provides a privacy-focused way to track development activity without transmitting sensitive data to external servers. The tool distinguishes itself through its ability to aggregate disparate log formats from multiple coding assistants into a unified, schema-agnostic representation. It features a decoupled pricing engine that allows users to apply custom model-specific cost multipliers, over
Normalizes disparate log formats from multiple coding assistants into a unified internal representation for consistent analysis.
Sigma is a generic SIEM signature format and log event pattern standard used to describe malicious activity. It provides a vendor-neutral system for defining security event patterns in YAML, ensuring that detection logic remains portable across different monitoring platforms. The project maintains a curated library of peer-reviewed detection rules that identify threats and compliance violations. This standardized approach allows for the exchange of threat hunting logic and the translation of generic signatures into specific queries for various security information and event management systems
Enforces a strict data model for log event patterns to ensure consistency across shared detection rules.
Sub-Store is a proxy subscription management server that aggregates multiple subscription links into a single unified stream for distribution to various clients. It functions as a transformation pipeline that filters, modifies, and reformats proxy node metadata. The system acts as a cross-platform format converter to ensure compatibility across diverse client applications. It includes an encryption decryption gateway that uses private keys to handle age-standard encrypted subscription content and a cache-layered aggregator to reduce external requests. The server provides capabilities for dyn
Provides a pipeline to intercept and modify proxy node metadata during the data transfer process.
Delta is a lakehouse table format that brings ACID transactions and data warehouse consistency to large scale data lakes on cloud object storage. It serves as an ACID transaction manager, coordinating atomic commits and serializable isolation for concurrent reads and writes across distributed compute engines. The project provides a multi-engine interoperability layer that uses format translation to allow diverse SQL engines and processing frameworks to read and write the same tables. It functions as a data versioning system, utilizing a transaction log to enable time travel, historical snapsh
Validates that incoming data matches the defined table structure to prevent corruption.
InfoSpider is a personal data aggregator and digital footprint analyzer. It extracts user activity and history from social platforms and local browser database files to consolidate information into a unified format. The system functions as a social media archiving tool that converts feed data and albums from external links into downloadable PDF documents for offline preservation. It also serves as a browser history extractor that reads local SQLite database files to retrieve and analyze web navigation history. The project covers capabilities for data aggregation, digital footprint analysis,
Standardizes disparate activity logs from multiple platforms into a single unified format.
OmniAuth is a rack authentication framework that allows applications to verify user identities through third-party service providers using a single standardized interface. It functions as middleware to separate identity verification from core application logic by intercepting incoming requests. The project employs a strategy-pattern provider model to encapsulate provider-specific logic into interchangeable classes. It provides a custom authentication strategy framework and base classes for building new providers based on industry standards. The framework handles the multi-step authentication
Normalizes diverse user data from different third-party sources into a consistent hash structure.
VidBee is a self-hosted media download manager that wraps the yt-dlp engine to download videos and audio from over 1000 websites. It functions as both a desktop client and a Fastify-based web service, managing downloads through a persistent queue with pause, resume, retry, and real-time progress tracking. The application uses cookie-based authentication to access login-gated, age-restricted, or subscriber-only content by importing browser cookies or Netscape-format cookie files. The application distinguishes itself through automated download workflows, including RSS and Atom feed monitoring t
Allows users to select output container formats like MP4, MKV, or WebM for downloads.