3 dépôts
Systems designed to move massive volumes of structured and unstructured data between diverse databases and cloud storage.
Distinct from Large-Scale Data Computation: Focuses on the integration and movement of diverse data at scale, rather than just computation or storage management.
Explore 3 awesome GitHub repositories matching data & databases · Large Scale Data Integration Frameworks. Refine with filters or upvote what's useful.
DataX is a distributed data integration framework and plugin-based ETL tool designed for synchronizing large datasets between heterogeneous sources and destinations. It functions as a JDBC data migration engine and offline synchronization tool, enabling the movement of data between relational databases, NoSQL stores, and object storage. The system utilizes a plugin-based connector architecture that decouples reader and writer logic, allowing it to map and transform data types across different storage engines using a standardized internal representation. This design supports heterogeneous data
Functions as a distributed framework for synchronizing massive volumes of data between heterogeneous sources and destinations.
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Moves massive volumes of structured and unstructured data between diverse databases, cloud storage, and messaging systems.
DotnetSpider est un framework de crawling web .NET et un outil d'extraction de données C# conçu pour la découverte automatisée de pages web et la récupération de données structurées sur Internet à grande échelle. Il fonctionne comme une bibliothèque de web scraping de haut niveau pour collecter des informations sur divers sites web. Le framework offre des capacités pour le crawling web automatisé et le scraping de données à grande échelle. Il permet l'extraction de contenu web pour faciliter la création de bases de données locales ou l'analyse d'informations en ligne via l'automatisation web programmatique au sein de l'écosystème .NET. Le système utilise un modèle de traitement de données basé sur un pipeline avec une gestion asynchrone des requêtes et une exécution concurrente des workers. Il dispose d'un planificateur basé sur une file d'attente de tâches, de fournisseurs de stockage modulaires et d'une implémentation pilotée par interface pour une logique de scraping personnalisée.
Simplifies the collection of large datasets by extracting specific data points from web pages through a structured process.