64 repository-uri
End-to-end workflows that automate the movement and sequential processing of data from source to destination.
Explore 64 awesome GitHub repositories matching data & databases · Processing Pipelines. Refine with filters or upvote what's useful.
Graphify is a knowledge retrieval system that transforms directories of source code and documentation into structured, queryable project maps. It utilizes a code-to-graph parser to extract technical metadata and system connectivity, converting a mix of code, SQL schemas, and documentation into a unified graph structure. The project distinguishes itself by integrating these knowledge graphs with AI coding assistants through a Model Context Protocol server and dedicated tool hooks. This allows AI agents to perform lookups and impact analysis on node neighbors and shortest paths to understand ho
Uses graph data to perform lookups on node neighbors and shortest paths to analyze how code changes affect the system.
Understand-Anything is a codebase architecture visualization tool that transforms source code and documentation into interactive knowledge graphs. It maps files, functions, and classes into a node-edge model to visualize architectural dependencies and project structures. The project provides specialized workflows for impact analysis, tracing connectivity paths from code modifications to identify affected downstream components. It also enables technical onboarding through automated architecture tours and the conversion of technical documentation into navigable networks of interconnected ideas.
Traces connectivity paths from modified files to identify affected downstream architectural components.
Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectures. By utilizing a directed acyclic graph approach, the framework allows users to build intricate models with multiple inputs, outputs, and shared layers, ensuring consistent numerical execution through functional state management. The project distinguishes itself as a multi-backend machine learning
Streams large datasets into training loops by handling batching, shuffling, and preprocessing tasks automatically.
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
Processes individual data items through a sequential chain of validation, cleaning, and storage handlers before persistence.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Automates the ingestion, parsing, and structuring of unstructured files through a modular pipeline for downstream data analysis.
This project is a neural text-to-speech engine and voice cloning toolkit designed to generate synthetic speech that mimics the vocal characteristics of a target speaker. It functions as a real-time audio synthesizer, utilizing a deep learning pipeline to convert written text into high-fidelity speech output with minimal latency. The system employs a transfer learning framework that leverages pre-trained speaker verification models to adapt synthesis to new, unseen vocal identities. By using an encoder-based speaker embedding process, the toolkit maps variable-length audio samples into a laten
Structures speech synthesis into distinct, swappable encoder and decoder stages for modular performance optimization.
Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification. By utilizing a modular architecture, the platform allows users to swap model components to balance inference speed and accuracy requirements for diverse applications. The framework distinguishes itself through its support for real-time processing and flexible deployment. It in
Adapts various dataset structures and annotation formats on-the-fly to feed training pipelines without requiring manual pre-conversion.
Zustand is a state management library that provides a centralized store for managing shared application data. It functions as a reactive container that connects application state to components, allowing them to subscribe to specific slices of data and trigger updates automatically. By utilizing selector-based data access and immutable state updates, the library ensures that components only re-render when their observed data changes, maintaining a predictable and efficient data flow. The library distinguishes itself through a pluggable, middleware-based architecture that allows for the extensi
Manages sequential data processing and API workflows to update the user interface once background tasks complete.
This project is a command-line storage manager that provides a unified interface for performing file operations across local filesystems and diverse cloud storage providers. It functions as a cross-platform storage abstraction, utilizing a modular backend architecture to map heterogeneous cloud storage APIs into a standard set of file system operations. This allows for consistent data management and movement regardless of the underlying storage service. The tool serves as a network data transfer engine designed for automated data migration and cloud storage synchronization. It distinguishes i
Orchestrates complex file operations across multiple storage platforms through a unified command interface.
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to provide context-aware responses for chat and completion requests. The system distinguishes itself through a database-agnostic abstraction layer that supports various storage backends, ranging from local disk storage to enterprise-grade vector databases. It offers flexible deployment
Standardizes the ingestion, parsing, and vectorization of files to facilitate semantic search across internal knowledge bases.
Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users to map facial identities between source and destination datasets while maintaining structural alignment and lighting consistency across video frames. The project distinguishes itself through a highly extensible plugin-based architecture that handles hardware-accelerated process
Supervises the runtime execution and monitoring of data extraction tasks across the processing pipeline.
Daily stock analysis is an automated research platform that utilizes large language models to process financial market data. The system functions as an investment analyst, transforming raw market feeds into structured reports to generate actionable trading insights. The platform distinguishes itself through a modular orchestration pipeline that allows users to integrate various artificial intelligence backends. By utilizing a provider-agnostic interface, the system enables the selection of preferred language models to interpret complex financial information according to user-defined parameter
Decomposes financial data processing into modular, swappable stages orchestrated by language models to generate insights.
RocksDB is a high-performance, embeddable persistent key-value library and storage engine based on Log-Structured Merge-trees. It is designed to provide durable storage for large-scale datasets, integrating directly into applications to manage data on flash and RAM-based hardware. The engine is distinguished by its focus on minimizing read and write amplification through multi-threaded compaction and custom memory allocators. It features specialized optimizations for flash storage, including support for zoned block devices, and provides the ability to extend store behavior via external plugin
Automatically removes old data based on a configurable time-to-live (TTL) threshold.
Fyne is a cross-platform graphical user interface toolkit for the Go programming language. It provides a comprehensive framework for building native applications that run on desktop, mobile, and web environments from a single codebase. The toolkit centers on a canvas-based rendering engine and a device-independent layout engine, ensuring that visual elements maintain consistent dimensions and behavior across diverse operating systems and screen densities. The project distinguishes itself through a reactive data-binding system that automatically synchronizes application state with interface co
Registers callback functions to automatically react to state changes in bound data items.
InvokeAI is a self-hosted, professional-grade platform designed for managing generative models and performing complex image synthesis. It provides a local application environment that allows users to execute diffusion models directly on their own hardware, ensuring data privacy and complete ownership of all generated assets. The platform distinguishes itself through a node-based workflow system that enables the construction of reproducible and automated image generation pipelines. By chaining modular functional units into directed acyclic graphs, users can automate intricate production tasks
Enables construction of custom generation pipelines by connecting modular processing nodes.
Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis. The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
Orchestrates modular processing steps into automated sequences for LLM-based agentic tasks.
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance d
Provides reactive streams for monitoring and responding to local document modifications in real time.
Gum is a toolkit for building interactive, visually styled command-line interfaces and prompts directly within shell scripts. It functions as a library of modular components that allow developers to enhance terminal workflows by adding structured layouts, formatted text, and user-input widgets to standard command-line operations. The project distinguishes itself by providing a suite of specialized utilities for common shell tasks, such as fuzzy-matched selection menus, interactive file system navigation, and confirmation dialogs. It translates high-level styling and layout instructions into t
Processes and displays text using templates to inject dynamic values into command-line output.
Forem is an open-source platform designed for building and managing technical communities. It functions as a social publishing engine that enables members to share long-form content, participate in threaded discussions, and engage through social interactions. The platform provides tools for organizations to maintain branded profiles, host community hackathons, and facilitate collaborative learning through structured educational tracks. Beyond its social features, Forem integrates advanced capabilities for AI agent workflow orchestration and codebase knowledge graphing. It allows developers to
Identifies which architectural components are affected by code modifications to assist in impact assessment.
Vector is a high-performance observability data pipeline designed to collect, transform, and route logs, metrics, and traces across distributed infrastructure. It functions as a modular engine that decouples data ingestion from processing and transmission, utilizing a component-based architecture to connect diverse sources to multiple destinations. The project distinguishes itself through a focus on reliability and flow control. It implements backpressure-aware data movement to prevent data loss during traffic spikes and utilizes disk-backed event buffering to ensure durability during network
Adjusts data processing configurations in real time without requiring a service restart to apply changes to the active pipeline.