Why is mendableai/firecrawl a recommended Data Processing Frameworks GitHub Repositories repository?

Transforms unstructured web pages and documents into standardized, machine-readable formats using natural language prompts.

Why is opendatalab/mineru a recommended Data Processing Frameworks GitHub Repositories repository?

Transforms unstructured document content into standardized, machine-readable formats for automated information retrieval.

Why is burntsushi/ripgrep a recommended Data Processing Frameworks GitHub Repositories repository?

Buffers sequential data in large chunks to maintain high performance during extensive file system reads.

Why is pathwaycom/pathway a recommended Data Processing Frameworks GitHub Repositories repository?

Executes high-performance data transformations using a unified engine capable of managing both batch and streaming sources.

Why is docling-project/docling a recommended Data Processing Frameworks GitHub Repositories repository?

Normalizes diverse input formats into a consistent internal data model to enable uniform processing across different sources.

Why is ffmpeg/ffmpeg a recommended Data Processing Frameworks GitHub Repositories repository?

Constructs non-linear processing pipelines that support multiple inputs and outputs to perform advanced tasks like video overlaying or audio mixing.

Why is pathwaycom/llm-app a recommended Data Processing Frameworks GitHub Repositories repository?

Delivers a high-performance environment designed for large-scale data ingestion and complex transformation tasks.

Why is werwolv/imhex a recommended Data Processing Frameworks GitHub Repositories repository?

Maps complex binary structures to human-readable fields by applying custom schema definitions to raw file contents.

Why is resin-io/etcher a recommended Data Processing Frameworks GitHub Repositories repository?

Implements buffered stream processing to handle large system images without exhausting system memory.

Why is apache/flink a recommended Data Processing Frameworks GitHub Repositories repository?

Provides a unified runtime that executes both unbounded streaming and bounded batch workloads with consistent semantics.

65 Repos

Awesome GitHub RepositoriesData Processing Frameworks

Software libraries and platforms providing structured environments for parsing, transforming, and managing data flows.

Explore 65 awesome GitHub repositories matching data & databases · Data Processing Frameworks. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

mendableai/firecrawl
mendableai/firecrawl
139,399Auf GitHub ansehen
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Transforms unstructured web pages and documents into standardized, machine-readable formats using natural language prompts.
TypeScript
Auf GitHub ansehen139,399
opendatalab/mineru
opendatalab/MinerU
67,734Auf GitHub ansehen
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recogn
Transforms unstructured document content into standardized, machine-readable formats for automated information retrieval.
Pythonai4sciencedocument-analysisextract-data
Auf GitHub ansehen67,734
burntsushi/ripgrep
BurntSushi/ripgrep
65,112Auf GitHub ansehen
ripgrep is a command-line utility designed for searching through large file trees and source code repositories. It functions as a recursive text processor that traverses directories to locate and display matching patterns, serving as a high-performance alternative to traditional search tools. The tool distinguishes itself through a focus on execution speed and intelligent file handling. It utilizes a finite automata-based regular expression engine to ensure linear time complexity and employs hardware-level acceleration for literal byte sequence scanning. By integrating with version control sy
Buffers sequential data in large chunks to maintain high performance during extensive file system reads.
Rustclicommand-linecommand-line-tool
Auf GitHub ansehen65,112
pathwaycom/pathway
pathwaycom/pathway
62,959Auf GitHub ansehen
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Executes high-performance data transformations using a unified engine capable of managing both batch and streaming sources.
Pythonbatch-processingdata-analyticsdata-pipelines
Auf GitHub ansehen62,959
docling-project/docling
docling-project/docling
61,674Auf GitHub ansehen
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Normalizes diverse input formats into a consistent internal data model to enable uniform processing across different sources.
Pythonaiconvertdocument-parser
Auf GitHub ansehen61,674
ffmpeg/ffmpeg
FFmpeg/FFmpeg
61,176Auf GitHub ansehen
FFmpeg is a cross-platform multimedia framework designed for the recording, conversion, and streaming of audio and video content. It functions as a comprehensive toolkit that provides both a command-line utility for direct media manipulation and a collection of low-level libraries for integration into custom applications. At its core, the project utilizes a packet-based stream engine and a format-agnostic abstraction layer to handle diverse media standards, containers, and network protocols. The framework distinguishes itself through a modular, graph-based filter execution model that allows f
Constructs non-linear processing pipelines that support multiple inputs and outputs to perform advanced tasks like video overlaying or audio mixing.
Caudiocffmpeg
Auf GitHub ansehen61,176
pathwaycom/llm-app
pathwaycom/llm-app
59,341Auf GitHub ansehen
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Delivers a high-performance environment designed for large-scale data ingestion and complex transformation tasks.
Jupyter Notebookchatbothugging-facellm
Auf GitHub ansehen59,341
werwolv/imhex
WerWolv/ImHex
53,892Auf GitHub ansehen
ImHex is a professional-grade hex editor and binary data analysis platform designed for inspecting, modifying, and reverse engineering raw file contents. It functions as a schema-driven engine that interprets complex binary structures by applying custom definitions to map and visualize byte-level data. The platform distinguishes itself through a dedicated domain-specific language that allows users to define structural schemas for automated file parsing. This capability is supported by a dynamic plugin architecture and an event-driven registry, which enable the integration of external modules
Maps complex binary structures to human-readable fields by applying custom schema definitions to raw file contents.
C++analyzerbinary-analysisc-plus-plus
Auf GitHub ansehen53,892
resin-io/etcher
resin-io/etcher
33,874Auf GitHub ansehen
Etcher is a disk image writer and operating system flashing tool used to create bootable USB drives and SD cards. It transfers binary system images to physical external media, enabling computers or microcontrollers to boot from the prepared storage. The application includes system drive protection to prevent the accidental erasure of internal hard drives by filtering available storage devices based on metadata. It also performs data verification by comparing written bytes against the source image to ensure no corruption occurred during the flashing process.
Implements buffered stream processing to handle large system images without exhausting system memory.
TypeScript
Auf GitHub ansehen33,874
apache/flink
apache/flink
26,086Auf GitHub ansehen
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Provides a unified runtime that executes both unbounded streaming and bounded batch workloads with consistent semantics.
Java
Auf GitHub ansehen26,086
terrastruct/d2
terrastruct/d2
23,083Auf GitHub ansehen
This project is a diagram-as-code tool that transforms declarative text scripts into professional visual representations. It functions as a technical documentation generator, allowing users to define nodes, connections, and hierarchical relationships through a domain-specific modeling language that integrates directly into version-controlled developer workflows. The tool distinguishes itself through a highly modular architecture that decouples diagram definitions from spatial positioning. It features a pluggable layout engine that supports multiple arrangement algorithms, alongside a styling
Normalizes input scripts into a unified intermediate graph representation to facilitate consistent cross-format rendering.
Godeveloper-toolsdiagrammingdiagrams
Auf GitHub ansehen23,083
vonng/ddia
Vonng/ddia
22,648Auf GitHub ansehen
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Orchestrates data movement using unified engines for both batch and stream processing models.
Pythonbookdatabaseddia
Auf GitHub ansehen22,648
vectordotdev/vector
vectordotdev/vector
22,071Auf GitHub ansehen
Vector is a high-performance observability data pipeline designed to collect, transform, and route logs, metrics, and traces across distributed infrastructure. It functions as a modular engine that decouples data ingestion from processing and transmission, utilizing a component-based architecture to connect diverse sources to multiple destinations. The project distinguishes itself through a focus on reliability and flow control. It implements backpressure-aware data movement to prevent data loss during traffic spikes and utilizes disk-backed event buffering to ensure durability during network
Provides exactly-once processing semantics to ensure data integrity during retries and system failures.
Rusteventsforwarderhacktoberfest
Auf GitHub ansehen22,071
huggingface/datasets
huggingface/datasets
21,643Auf GitHub ansehen
Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams. The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
Acts as a core framework for applying parallel transformations and filtering to massive data collections.
Pythonaiartificial-intelligencecomputer-vision
Auf GitHub ansehen21,643
apache/incubator-mxnet
apache/incubator-mxnet
20,812Auf GitHub ansehen
Apache MXNet is a deep learning framework and distributed machine learning library designed for training and deploying neural networks across distributed systems, mobile devices, and hardware accelerators. It functions as a cross-platform runtime and a dynamic dataflow scheduler that optimizes neural network execution. The framework provides a multi-language API, enabling the development of machine learning models using Python, R, Julia, Scala, Go, and JavaScript. It supports high-performance model training and the scaling of workloads across multiple GPUs and machines. The system covers cap
Tracks changes to data buffers within the execution graph to incrementally propagate updates and optimize memory efficiency.
C++
Auf GitHub ansehen20,812
cube-js/cube
cube-js/cube
20,251Auf GitHub ansehen
Cube is a semantic data layer that provides a unified framework for defining business metrics, dimensions, and relationships across diverse data sources. By acting as a headless business intelligence engine, it transforms raw data into a governed model that can be queried via SQL, REST, and GraphQL interfaces. This architecture ensures consistent data definitions and logic across all downstream analytical applications and reporting tools. The platform distinguishes itself through its integrated conversational AI capabilities, which allow users to explore data using natural language. It orches
Merges historical warehouse data with real-time streams using pre-aggregations for unified analytical views.
Rustagentic-analyticsagentsai
Auf GitHub ansehen20,251
nats-io/nats-server
nats-io/nats-server
20,076Auf GitHub ansehen
NATS Server is a high-performance, lightweight messaging system designed for cloud-native applications, edge computing, and distributed microservices. It functions as a distributed publish-subscribe broker that routes messages using hierarchical, dot-separated subject strings, enabling decoupled communication between services without requiring centralized broker lookups. The system supports core messaging patterns including asynchronous publish-subscribe, request-reply, and load-balanced queue processing. The platform distinguishes itself through a decentralized architecture that eliminates t
Combines message deduplication with synchronous acknowledgment verification to ensure messages are processed exactly once without loss or duplication.
Gocloudcloud-computingcloud-native
Auf GitHub ansehen20,076
alibaba/datax
alibaba/DataX
17,241Auf GitHub ansehen
DataX is a distributed data integration framework and plugin-based ETL tool designed for synchronizing large datasets between heterogeneous sources and destinations. It functions as a JDBC data migration engine and offline synchronization tool, enabling the movement of data between relational databases, NoSQL stores, and object storage. The system utilizes a plugin-based connector architecture that decouples reader and writer logic, allowing it to map and transform data types across different storage engines using a standardized internal representation. This design supports heterogeneous data
Employs internal data models that normalize diverse input formats into a consistent structure for uniform processing across different storage engines.
Java
Auf GitHub ansehen17,241
heibaiying/bigdata-notes
heibaiying/BigData-Notes
16,912Auf GitHub ansehen
BigData-Notes is a big data learning resource and data engineering knowledge base. It provides a collection of guides, technical references, and documentation focused on the installation and configuration of distributed data processing technologies. The project covers a learning path for distributed systems, including the setup of large-scale data storage and computing clusters. It specifically addresses both batch and stream processing workflows and the implementation of data APIs for interacting with distributed messaging and storage systems. The materials are organized using markdown-base
Documents the use of unified engines for processing both historical batch data and live data streams.
Javaazkabanbig-databigdata
Auf GitHub ansehen16,912
emqx/emqx
emqx/emqx
16,422Auf GitHub ansehen
This project is a high-performance MQTT broker and IoT data platform designed to manage millions of concurrent device connections. It provides a scalable infrastructure for ingesting, processing, and routing telemetry data across distributed systems, utilizing an actor-based concurrency model to maintain high availability and state synchronization across cluster nodes. The platform distinguishes itself through integrated stream processing and edge computing capabilities. It allows users to execute declarative SQL-based rules directly against incoming message streams for real-time filtering, t
Filters, aggregates, and transforms data streams using SQL-based rules before forwarding them to external systems.
Erlangaiotbrokercoap
Auf GitHub ansehen16,422