5 Repos
Performing quality checks and exploratory analysis on distributed tabular datasets.
Distinct from Dataframe Processing: Focuses on the analysis of Spark DataFrames specifically, whereas Dataframe Processing is general programmatic manipulation.
Explore 5 awesome GitHub repositories matching data & databases · Distributed Dataframe Analysis. Refine with filters or upvote what's useful.
This project is a data profiling and exploratory data analysis tool designed to generate automated quality reports for Pandas and Spark dataframes. It serves as a system for computing descriptive statistics, identifying correlations, and analyzing univariate and multivariate data patterns. The tool provides specialized capabilities for comparing different versions of datasets to identify changes in data quality and distributions. It includes a dedicated profiler for time-dependent data to extract statistical information such as seasonality and auto-correlation. The software covers a broad an
Implements large-scale data quality checks and exploratory analysis specifically for Spark DataFrames.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Exposes in-memory Polars DataFrames to browser clients over a WebSocket connection for remote analysis.
Pixie is an open-source observability platform for Kubernetes that uses eBPF to automatically capture telemetry data from clusters without requiring any manual instrumentation or code changes. It functions as an eBPF telemetry collector, a continuous application profiler, a network traffic analyzer, and a scriptable telemetry query engine, all within a single Kubernetes-native tool. The platform distinguishes itself through several integrated capabilities. It continuously samples stack traces from compiled-language code to identify CPU performance bottlenecks, visualizing the results as inter
Transforms tabular telemetry data through immutable dataframe operations for observability analysis.
Dieses Projekt ist eine Python-Bibliothek für Datenanalyse und ein Framework für explorative Datenanalyse, das für die Verarbeitung von Rohdatensätzen konzipiert ist. Es bietet eine Suite von Tools zur Untersuchung von Daten, zur Identifizierung von Anomalien und zur Anwendung statistischer Methoden, um Muster aufzudecken. Das Repository fungiert als Machine-Learning-Modellierungs-Toolkit und statistische Datenmodellierungssuite. Es enthält prädiktive Algorithmen und mathematische Modelle, die verwendet werden, um Beziehungen zwischen Datenvariablen zu analysieren und Erkenntnisse aus komplexen Datensätzen abzuleiten. Das Projekt deckt ein breites Spektrum an Funktionen ab, einschließlich Data Science, Machine-Learning-Modellierung und explorativer Datenanalyse. Diese werden durch Datenmanipulation, numerische Berechnung und Datenvisualisierung implementiert.
Provides capabilities to perform numerical transformations and filtering on tabular data structures to derive insights.
This is a comprehensive Python programming course and technical curriculum designed to take users from foundational syntax to advanced development patterns. It serves as a multi-disciplinary educational suite covering programming fundamentals, object-oriented design, and data analysis. The project provides specialized guides on professional development techniques, including the use of decorators, generators for memory management, and dunder-method operator overloading. It also includes instructional material on executing parallel tasks through concurrency and multiprocessing to reduce executi
Provides a suite for loading structured datasets and performing numerical transformations using DataFrames.