32 repositorios
Merging columns from multiple objects based on shared keys or indexes.
Distinguishing note: General purpose joining functionality for data structures.
Explore 32 awesome GitHub repositories matching data & databases · Data Joins. Refine with filters or upvote what's useful.
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized
Combines columns from multiple data objects by matching on indexes or specified columns.
graphql-engine is an automated GraphQL API engine that transforms database tables and relationships into a queryable GraphQL schema. It functions as a federation gateway and mapper, instantly generating APIs with built-in filtering, pagination, and mutations from existing databases and remote schemas. The project distinguishes itself through a fine-grained access control layer that enforces row-level and field-level permissions. It further provides a real-time data subscription server that converts standard queries into live streams and a system for triggering event-driven webhooks and notifi
Creates relationships between data in a local database and fields in a remote schema.
Ramda is a functional JavaScript standard library and toolset for immutable data transformation and composition. It provides a comprehensive suite of pure utility functions designed to enable declarative data processing pipelines. The library is distinguished by its use of automatic function currying and a data-last argument order. These design patterns allow multi-argument functions to be partially applied, simplifying the construction of processing chains where data is passed through a sequence of operations. The toolkit covers broad data manipulation capabilities, including list processin
Implements SQL-style inner, outer, left, and right joins to combine lists of objects.
DataEase is an open-source, self-hosted business intelligence platform designed for building interactive data visualizations and managing analytical reporting. It provides a centralized environment where users can construct dashboards through a drag-and-drop interface, connecting to diverse data sources including relational databases, data warehouses, and external APIs. The platform distinguishes itself through its focus on embedded analytics and enterprise-grade governance. It allows for the seamless integration of charts, dashboards, and management modules into third-party web applications
Combines data from different sources using join logic to create unified analytical views.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Provides mechanisms for merging data streams and tables based on shared keys to enable complex correlations.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Rewrites join keys containing excessive null values based on historical distribution data to prevent performance bottlenecks.
Dask es un framework de computación paralela y un programador de tareas distribuido diseñado para escalar flujos de trabajo de ciencia de datos en Python desde máquinas individuales hasta grandes clústeres. Funciona como un gestor de recursos de clúster que orquesta la lógica computacional representando las tareas y sus dependencias como grafos acíclicos dirigidos. Esta arquitectura permite al sistema automatizar la distribución de cargas de trabajo a través del hardware disponible mientras gestiona requisitos de ejecución complejos. El proyecto se distingue por un motor de evaluación perezosa que difiere las operaciones de datos hasta que se solicitan explícitamente, permitiendo la optimización global del grafo y una asignación eficiente de recursos. Incorpora el volcado de datos consciente de la memoria para evitar fallos del sistema al procesar conjuntos de datos que exceden la memoria disponible, y utiliza la fusión de grafos de tareas para combinar secuencias de operaciones en pasos de ejecución únicos, minimizando la sobrecarga de programación y la comunicación entre nodos. La plataforma proporciona una superficie de capacidades integral para el análisis de datos a gran escala, incluyendo soporte para aprendizaje automático distribuido, integración de computación de alto rendimiento y procesamiento de datos en paralelo. Ofrece herramientas extensas para la gestión del ciclo de vida del clúster, perfilado de rendimiento y monitoreo en tiempo real de la ejecución de tareas. Los usuarios pueden desplegar estos entornos en diversas infraestructuras, incluyendo hardware local, proveedores de nube, sistemas en contenedores y clústeres de computación de alto rendimiento.
Combines multiple datasets by matching keys, handling index-based joins or network-wide data shuffles for non-indexed operations.
xsv is a suite of high-performance command-line utilities written in Rust for the analysis, manipulation, and statistical processing of large delimited datasets. It provides a toolkit for processing comma-separated value files through a command line interface. The project provides capabilities for statistical analysis, including the computation of column statistics, value frequencies, and descriptive metrics. It also includes data manipulation utilities for joining, slicing, sampling, and reformatting records. The toolkit covers a broad range of data operations including column selection, da
Provides high-performance inner, outer, and cross joins for combining large CSV datasets.
q is a command-line utility for the processing, filtering, and aggregation of tabular text and database files using standard SQL syntax. It functions as a query engine that treats CSV and TSV files, as well as standard input, as relational database tables. The tool distinguishes itself by providing a persistent cache layer that stores processed tabular data in a binary format to accelerate repeated queries on large datasets. It also maps individual filenames or stream identifiers to relational table names, enabling SQL joins across disparate text files. The project covers a broad range of da
Combines data from different files using join operations to link datasets based on shared columns.
cuDF is a GPU-accelerated dataframe library and data processing engine designed for manipulating and analyzing large tabular datasets. It provides a high-level API for executing filtering, joining, and aggregating operations directly on GPU hardware. The project integrates the Apache Arrow memory format to enable zero-copy data transfers and includes a just-in-time compiler for executing custom user-defined functions on the GPU. The library features specialized acceleration for existing workflows by redirecting standard Pandas dataframe calls and Polars query plans to a GPU backend. It also p
Supports merging dataframes using various join types including inner, left, right, full, semi, and anti joins.
VisiData is a terminal-based interactive data analysis tool and browser designed for exploring, filtering, and sorting large tabular datasets. It functions as a structured data inspector that loads and flattens complex formats like JSON, XML, and PCAP into interactive sheets, as well as a terminal file manager for navigating directories and performing staged filesystem operations. The project distinguishes itself by rendering data visualizations, such as scatter plots and histograms, directly in the terminal using Unicode Braille characters. It provides a Python-based data wrangling environme
Implements inner, outer, and full joins to merge multiple datasets based on shared keys.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Merges large datasets without materializing the result in memory to minimize resource consumption.
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
Merges multiple JavaScript arrays by matching common keys or properties to create combined data sets.
ToyDB is a distributed SQL database that provides a system for storing and querying data across multiple nodes. It focuses on maintaining strong consistency and fault tolerance through the implementation of a distributed consensus algorithm. The project distinguishes itself by supporting historical data versioning, enabling time-travel queries to retrieve the state of the database from a specific point in the past. It utilizes multi-version concurrency control to manage ACID transactions and ensure data integrity during concurrent operations. The system covers relational data modeling with t
Implements general purpose joining functionality to merge columns from multiple tables based on shared keys.
Faust is a Python library for building distributed stream processing applications that integrate with Kafka. It functions as an asynchronous stream processor designed to handle high-throughput event streams and real-time data analysis using asynchronous functions. The system operates as a distributed stream processor and state store, utilizing sharding and partitioned topics to scale processing workloads horizontally across multiple worker nodes. It maintains state through a replicated key-value storage system backed by local databases to ensure high availability and fast recovery. The frame
Combines multiple real-time streams or tables using join operations to correlate event data.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Correlates data from multiple sources using hash joins for memory-efficient lookups or co-grouping for key-based aggregation.
TanStack Form is a cross-framework form state management library that provides typed fields, validation, and submission across React, Vue, Angular, Solid, Lit, Svelte, and Preact. It uses a shared form model that adapts to different UI frameworks while preserving the same validation and submission logic, and offers headless form controls that impose no UI markup, letting developers bring their own inputs and design system. The library distinguishes itself through granular state subscription, where components subscribe to narrow slices of form or field state using reactive primitives, so only
Combines data from multiple collections in live queries to denormalize and unify data from different sources.
csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data. The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without req
Joins CSV files on common columns using command-line operations.
Perfetto is a platform for system-level performance tracing and analysis on Linux and Android. It combines a high-throughput trace recorder, a SQL-based query engine, and a browser-based visualizer into a single toolchain. The platform covers CPU scheduling and call-stack profiling, native and Java heap memory allocation tracking, GPU and graphics events, and system-wide counters such as CPU frequency and power consumption. The architecture decouples trace recording from offline analysis, using a compact protobuf format for event encoding and columnar storage for efficient SQL queries. The we
Provides SQL operators for correlating trace events through span joins and flow functions.
This repository is a curated study resource of interview questions and answers for data science roles. It covers the core domains of machine learning, statistics, Python programming, SQL databases, deep learning, and algorithmic problem solving. The content is organized as static Markdown files with a structured question-and-answer format, making it easy to read and navigate without any server-side processing. The material distinguishes itself by pairing each question with a detailed explanation and often a code example, covering both conceptual knowledge and practical application. Topics ran
Describes inner, left, right, full, and self joins for combining data from multiple tables.