3 repository-uri
Systems designed to move massive volumes of structured and unstructured data between diverse databases and cloud storage.
Distinct from Large-Scale Data Computation: Focuses on the integration and movement of diverse data at scale, rather than just computation or storage management.
Explore 3 awesome GitHub repositories matching data & databases · Large Scale Data Integration Frameworks. Refine with filters or upvote what's useful.
DataX is a distributed data integration framework and plugin-based ETL tool designed for synchronizing large datasets between heterogeneous sources and destinations. It functions as a JDBC data migration engine and offline synchronization tool, enabling the movement of data between relational databases, NoSQL stores, and object storage. The system utilizes a plugin-based connector architecture that decouples reader and writer logic, allowing it to map and transform data types across different storage engines using a standardized internal representation. This design supports heterogeneous data
Functions as a distributed framework for synchronizing massive volumes of data between heterogeneous sources and destinations.
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Moves massive volumes of structured and unstructured data between diverse databases, cloud storage, and messaging systems.
DotnetSpider este un framework .NET de web crawling și un instrument C# de extracție a datelor conceput pentru descoperirea automată a paginilor web și recuperarea datelor structurate de pe internet la scară largă. Funcționează ca o bibliotecă de web scraping de nivel înalt pentru colectarea informațiilor de pe diverse site-uri web. Framework-ul oferă capabilități pentru web crawling automat și web scraping la scară largă. Permite extracția conținutului web pentru a facilita crearea de baze de date locale sau analiza informațiilor online prin automatizarea web programatică în ecosistemul .NET. Sistemul utilizează un model de procesare a datelor bazat pe pipeline, cu gestionarea asincronă a cererilor și execuția concurentă a worker-ilor. Dispune de un scheduler bazat pe cozi de sarcini, furnizori de stocare modulari și o implementare bazată pe interfețe pentru logica de scraping personalizată.
Simplifies the collection of large datasets by extracting specific data points from web pages through a structured process.