100 个仓库
Practices and technologies for building and maintaining systems that process and store large-scale data.
Distinguishing note: Focuses on the end-to-end engineering of data systems in cloud environments.
Explore 100 awesome GitHub repositories matching data & databases · Data Engineering. Refine with filters or upvote what's useful.
D3 is a modular library providing low-level primitives for creating data-driven visualizations. It functions as a flexible framework that allows for direct control over visual presentation by mapping abstract data dimensions to graphical properties, such as position, color, and size, without imposing predefined chart abstractions. The library distinguishes itself by offering specialized tools for complex data representation, including algorithmic layouts for hierarchical structures and geographic projection utilities for mapping spherical coordinates. It also includes a comprehensive suite fo
Modular components like scales, axes, and shapes enable the construction of dynamic and interactive data visualizations.
This project is an enterprise-grade Java framework designed for building scalable, full-stack e-commerce applications. It provides a comprehensive foundation for microservice-based distributed architectures, enabling the development of complex retail platforms that include product management, order processing, and secure user authentication. By leveraging modular service patterns and centralized API gateways, the framework supports the construction of resilient systems that decompose monolithic business logic into independent, manageable services. The platform distinguishes itself through a r
Provides tools for filtering, searching, and analyzing machine-generated logs to identify system errors.
This project is a community-maintained, open-access directory of high-quality public datasets. It serves as a centralized reference point for researchers, developers, and data scientists to locate reliable information sources across a wide spectrum of industries and scientific fields. By providing a structured index, the repository facilitates the discovery of data necessary for exploratory analysis, machine learning model training, and the development of data-intensive applications. The directory distinguishes itself through a lightweight, platform-agnostic approach to resource indexing that
Aggregates high-quality, open-access datasets to help developers populate prototypes and test data-intensive applications.
Apache ECharts is a JavaScript data visualization library used for rendering interactive charts and complex data visualizations in web browsers. It functions as a canvas-based charting engine and a statistical data visualization suite that transforms datasets into visual representations. The framework provides specialized capabilities for three-dimensional data visualization, including the generation of 3D plots and globe visualizations. It also serves as a web-based geographic mapping tool for overlaying heatmaps, routes, and data distributions onto interactive maps. The library covers a br
Provides a comprehensive library for rendering interactive charts and complex data visualizations in web browsers.
ECharts is a JavaScript data visualization library and web charting framework used to render interactive 2D and 3D data plots within a web browser. It functions as a visualization engine that transforms raw data into customizable charts and graphs. The project includes a WebGL-based hardware acceleration engine specifically for producing three-dimensional plots and globe visualizations. This allows the library to handle large and complex datasets through GPU-accelerated rendering. The framework supports both canvas-based raster rendering and SVG-based vector rendering. It provides capabiliti
Functions as a modular library for rendering interactive charts and complex data visualizations in the browser.
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Automates the generation and real-time updating of searchable vector data for artificial intelligence applications.
MinIO is a software-defined, cloud-native object storage server designed to manage large volumes of unstructured data. It functions as a distributed storage cluster that aggregates multiple independent nodes into a unified, scalable pool, providing a high-performance infrastructure compatible with standard cloud storage protocols and application programming interfaces. The system utilizes a shared-nothing architecture that eliminates central metadata servers, relying instead on a decentralized hash table to map objects across the cluster. Data availability and resilience are maintained throug
Functions as a software-defined storage layer optimized for containerized deployments and distributed architectures.
Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification. By utilizing a modular architecture, the platform allows users to swap model components to balance inference speed and accuracy requirements for diverse applications. The framework distinguishes itself through its support for real-time processing and flexible deployment. It in
Refines raw image and video collections through automated annotation workflows to generate high-quality datasets for model training.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Analyzes and transforms continuous real-time data streams for immediate insight and analytics.
This project is an open-source educational curriculum designed to provide comprehensive training in data engineering. It focuses on building scalable data pipelines and managing cloud-native infrastructure through a structured, self-paced program that combines technical explanations with hands-on practical exercises. The curriculum distinguishes itself by emphasizing industry-standard methodologies, specifically teaching students how to implement infrastructure as code and manage data workflows through orchestration tools. By utilizing container-based environment isolation and declarative con
Focuses on building scalable data pipelines and storage systems using modern cloud infrastructure.
fhevm is a full-stack blockchain framework designed to integrate Fully Homomorphic Encryption into smart contracts. It provides a platform for developing confidential smart contracts that can process encrypted data and execute private on-chain computations without decrypting the underlying information. The framework utilizes a coprocessor system to offload resource-intensive encrypted operations to an asynchronous service, improving blockchain performance and scalability. It incorporates a secure key management service based on multi-party computation and a zero-knowledge proof verifier to en
Implements an asynchronous coprocessor system to offload resource-intensive encrypted operations and maintain blockchain throughput.
labelImg 是一个桌面图像标注工具和数据集准备实用程序,用于创建用于计算机视觉训练的标注数据集。它提供了一个图形界面,用于在图像中的对象周围绘制边界框并为其分配类标签,从而为机器学习模型构建地面实况数据。 该软件专门支持 Pascal VOC XML 标注格式,将图像坐标和类名导出为标准 XML 或文本结构。它允许用户从文本文件中加载预定义的类列表,以标准化整个项目的命名。 除了初始标注外,该工具还涵盖图像标注工作流,包括已保存标注的可视化和手动数据集验证。这包括将图像标记为已验证或困难以保持数据集质量的能力。
Facilitates the preparation of image collections for computer vision through manual annotation and formatting.
Deepface is a comprehensive deep learning library for facial recognition and demographic analysis. It provides a modular pipeline that handles the entire lifecycle of facial processing, including detection, geometric alignment, and the transformation of facial images into high-dimensional numerical vector embeddings for identity verification and similarity comparison. The library distinguishes itself through a model ensemble approach, which combines predictions from multiple pre-trained neural networks to improve classification accuracy and reduce bias. It also integrates advanced security fe
Handles asynchronous processing of data streams to support real-time facial analysis tasks.
Backtrader is a Python framework designed for the development, backtesting, and live execution of algorithmic trading strategies. It provides a comprehensive environment for quantitative finance, allowing users to simulate trading logic against historical market data or connect directly to brokerage platforms for automated real-time trading. The project distinguishes itself through a unified event-driven architecture that treats backtesting and live trading with the same API. This consistency is supported by a flexible data-feed abstraction layer that normalizes diverse financial sources, ena
Synchronizes data streams of varying granularities to evaluate long-term trends alongside short-term price movements.
Fx is a command-line processing suite designed for the transformation, conversion, exploration, and visualization of structured data. It functions as a terminal-based utility that handles both automated shell pipelines and interactive navigation of complex, nested data hierarchies. The tool distinguishes itself by integrating a JavaScript-based engine that executes user-provided logic to filter, map, or modify data fields within a sandboxed runtime. It maintains a responsive interface by decoupling data processing from the display loop, allowing users to explore large datasets through an inte
Converts raw text or log streams into structured formats to simplify searching and debugging.
NATS Server is a high-performance, lightweight messaging system designed for cloud-native applications, edge computing, and distributed microservices. It functions as a distributed publish-subscribe broker that routes messages using hierarchical, dot-separated subject strings, enabling decoupled communication between services without requiring centralized broker lookups. The system supports core messaging patterns including asynchronous publish-subscribe, request-reply, and load-balanced queue processing. The platform distinguishes itself through a decentralized architecture that eliminates t
Supports push and pull patterns for consuming persistent log data at scale.
Lila is an open-source chess server and multiplayer platform designed for playing, analyzing, and streaming games. It functions as a comprehensive environment for hosting competitive play and managing player profiles. The platform integrates a distributed chess engine interface to evaluate complex positions and a collaborative analysis board that allows multiple users to study and coordinate insights in real time. It also includes an online tournament platform for organizing competitive events, simultaneous exhibitions, and structured player leagues. The system maintains a searchable game da
Implements a distributed computing engine specialized for evaluating chess positions and calculating optimal moves in parallel.
Lila is a comprehensive, open-source chess gaming platform designed for real-time multiplayer interaction, competitive tournament management, and deep strategic analysis. It provides a global environment where users can engage in live matches, participate in structured competitions, and access extensive archives of historical game data for research and study. The platform distinguishes itself through a highly scalable architecture that utilizes actor-model concurrency and event-sourced game states to ensure precise match reconstruction and fault tolerance. It integrates distributed engine eva
Offloads computationally intensive move evaluations to a cluster of specialized servers for real-time tactical insights.
go-ipfs is an implementation of an IPFS node, providing a distributed filesystem and a content-addressable storage system. It enables the storage and retrieval of data based on unique cryptographic hashes rather than fixed network locations, allowing files to be shared across a peer-to-peer network without a central authority. The system utilizes a distributed hash table and a peer-to-peer gossip protocol to route requests and propagate network state and metadata. It organizes data using a Merkle DAG structure to support efficient deduplication and versioning of content. Capabilities include
Mounts remote content-addressed storage as local directories for seamless file access.
TiKV is a cloud-native distributed transactional key-value store and storage engine. It provides a distributed database designed for horizontal scalability and strong consistency across a cluster of physical nodes. The system uses a Raft-based consensus mechanism to maintain data availability and state synchronization. It ensures ACID compliance for distributed transactions through a two-phase commit workflow and manages data distribution via multi-Raft sharding. The engine handles massive datasets using automated range splitting and cluster load balancing to distribute data across different
Implements a coprocessor for executing filtering and aggregation logic directly on storage nodes to minimize network latency.