14 个仓库
High-performance utilities designed for importing massive datasets into database clusters.
Distinguishing note: Focuses on high-volume, large-scale ingestion performance, distinct from general-purpose data import.
Explore 14 awesome GitHub repositories matching data & databases · Bulk Data Loading. Refine with filters or upvote what's useful.
TiDB is a horizontally scalable, distributed SQL database designed to provide consistent transactional storage and high-performance analytical processing within a single unified architecture. It utilizes a decoupled compute-storage design and a distributed key-value storage layer to ensure horizontal scalability and efficient range-based queries. By employing a consensus-based replication algorithm, the system maintains high availability and automatic failover across multiple nodes and geographical regions. The platform distinguishes itself through its hybrid transactional and analytical proc
TiDB loads high volumes of data into database clusters from various file formats to support rapid data ingestion and large-scale migration projects.
This project is a feature-rich Go client library designed for interacting with Redis. It serves as a comprehensive interface for managing remote data stores, enabling developers to execute standard database commands, handle complex data structures, and perform asynchronous operations within Go applications. The library distinguishes itself through its support for advanced Redis capabilities, including connection pooling, pipelining, and transactional integrity. It provides specialized primitives for managing distributed clusters, including automated topology updates and request routing to sha
Executes bulk command sequences to efficiently populate or update database entries.
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Imports data from local files or cloud storage into database tables with schema validation.
Cayley is a graph database engine designed for storing and querying interconnected data using a quad-based data model. It functions as an RDF quad store, managing information through subjects, predicates, objects, and labels. The system features a modular graph store architecture with pluggable backends, allowing it to swap between in-memory storage and various external persistent databases. It includes a GraphQL-inspired API and a dedicated data visualizer for the interactive exploration of nodes and edges. Query capabilities cover bidirectional path traversal and multi-syntax execution usi
Provides high-performance utilities for batch importing massive datasets into the graph store.
YugabyteDB is a distributed SQL database and relational data store designed for horizontal scalability and high availability across multiple nodes or regions. It functions as a cloud-native system that ensures continuous availability and supports PostgreSQL compatible query languages and drivers. The system includes specialized capabilities as a vector database for AI, utilizing high-dimensional indexing to perform similarity searches. It is engineered as a multi-region cloud database that synchronizes data across different geographic locations to maintain global availability. The project co
Includes high-performance utilities for bulk loading massive datasets into the database cluster.
Redis is a high-performance in-memory key-value store that functions as a distributed cache, message broker, and NoSQL database. It provides sub-millisecond read and write access to data stored in RAM and can operate as a vector database for indexing high-dimensional embeddings. The system supports a wide range of data storage and synchronization primitives, including the management of strings, hashes, lists, sets, and JSON documents. It enables real-time data operations through atomic transactions, hybrid persistence using snapshots and append-only logs, and high-availability configurations
Uses specialized serialization protocols to stream massive datasets into the store with minimal latency.
pq is a PostgreSQL driver for Go that implements the standard database/sql interface. It serves as a connection library and protocol implementation that translates application data types into the binary and text formats required by PostgreSQL. The project provides specialized utilities for high-performance data ingestion using bulk data loading and a dedicated bulk data importer. It also features an implementation for listening to asynchronous server notifications and provides tools for connection load balancing across multiple hosts and ports. The driver covers a broad surface of database i
Ships high-performance bulk loading capabilities to stream multiple rows into tables with minimal overhead.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Provides high-performance utilities for importing massive historical datasets and static files from cloud storage.
Pentaho Kettle 是一个企业级 ETL 数据集成平台,旨在在不同源和目标数据库之间提取、转换和加载数据。它充当元数据驱动的编排器,利用可视化工作流设计器来创建和管理复杂的数据任务序列和转换管道。 该系统的特点是其分布式数据处理引擎,可在服务器节点集群上执行工作负载以提高吞吐量。它采用基于插件的架构,允许通过外部 JAR 文件扩展平台,以提供与各种数据库和云服务的连接。 该平台涵盖了广泛的数据集成功能,包括批量加载、远程文件管理和数据结构转换。它提供用于数据质量验证、管道自动化和作业生命周期管理的工具,以及用于跟踪服务器健康状况和实时执行状态的监控实用程序。
Provides high-performance utilities for efficiently transferring large volumes of records into target databases.
pgloader is a command-line tool that automates the migration of data and schema from various source databases and file formats into PostgreSQL. It combines schema discovery, parallel data pipelines, and type casting into a single, declarative workflow, using PostgreSQL's COPY protocol for high-throughput bulk loading. The tool distinguishes itself by compiling a dedicated command language into concurrent reader-writer pipelines that handle schema introspection, data transformation, and error-resilient batch processing. It supports migrating entire databases from MySQL, MS SQL, SQLite, and Pos
Migrates data from various database and file formats into PostgreSQL using the COPY command.
GraphQL-Ruby is a Ruby library for building GraphQL APIs with a strongly typed schema and a dedicated query execution engine. It provides a comprehensive framework for mapping application objects to a formal type system, enabling structured data fetching through defined resolvers. The project distinguishes itself with advanced performance and delivery mechanisms, including a data loader for batching and caching to prevent N+1 query patterns. It supports high-performance data delivery through incremental response streaming, deferred query responses, and parallel data fetching using fibers. Add
Collects multiple data requirements across the execution tree to fetch them in bulk and eliminate redundant requests.
Ignite 是一个分布式内存数据网格和计算平台。它作为一个分布式 SQL 数据库和存储引擎,旨在将大数据集存储和处理在 RAM 中,以最大限度地减少延迟并提高计算速度。 该系统以其多层存储引擎而著称,该引擎管理跨内存和磁盘的数据放置,以平衡高速访问与大容量存储。它具有一个分布式计算网格,可直接在数据所在的节点上执行自定义逻辑,从而减少网络流量。 该平台提供了一套广泛的功能,包括 ACID 事务管理、标准 SQL 查询和键值操作。它支持通过响应式流进行大容量数据摄取,并提供通过多种编程语言、标准数据库驱动程序和 REST API 的集成。该系统可以作为分布式集群部署在容器中,或通过 Kubernetes 进行编排。 该项目使用 Java 编写,可通过二进制归档文件安装。
Implements high-performance utilities for importing massive datasets using reactive streams and backpressure.
该项目是一个 SQL 数据访问层和模式生成器,允许通过将表视为简单数据结构来读取和写入关系数据库中的记录。它作为一个自动模式生成器,根据传入数据的结构即时创建数据库表和列。 该工具提供了一个高性能批量加载器,使用分组原子事务导入大数据集以确保数据一致性。它还包括一个记录 Upsert 机制,根据唯一标识符确定是更新现有行还是插入新行。 该系统涵盖动态模式管理,包括隐式列解析和表配置。它还提供了一个基于集合的查询接口,用于检索记录或提取唯一值,而无需编写手动查询。
Efficiently importing large sets of records into a database using bulk loading and transaction support.
linq2db is a type-safe object-relational mapper that translates LINQ expressions into optimized SQL queries for multiple database providers. It functions as a database mapper that links classes to tables and includes a SQL query builder and a command-line schema tool for generating data classes from existing databases. The project provides high-performance bulk data processing for inserting and loading large volumes of records via batch or binary copy methods. It also supports advanced SQL operations, including window functions, common table expressions for recursive hierarchical querying, an
Provides high-performance utilities for importing massive datasets from external sources into database tables.