5 dépôts
Performing quality checks and exploratory analysis on distributed tabular datasets.
Distinct from Dataframe Processing: Focuses on the analysis of Spark DataFrames specifically, whereas Dataframe Processing is general programmatic manipulation.
Explore 5 awesome GitHub repositories matching data & databases · Distributed Dataframe Analysis. Refine with filters or upvote what's useful.
This project is a data profiling and exploratory data analysis tool designed to generate automated quality reports for Pandas and Spark dataframes. It serves as a system for computing descriptive statistics, identifying correlations, and analyzing univariate and multivariate data patterns. The tool provides specialized capabilities for comparing different versions of datasets to identify changes in data quality and distributions. It includes a dedicated profiler for time-dependent data to extract statistical information such as seasonality and auto-correlation. The software covers a broad an
Implements large-scale data quality checks and exploratory analysis specifically for Spark DataFrames.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Exposes in-memory Polars DataFrames to browser clients over a WebSocket connection for remote analysis.
Pixie is an open-source observability platform for Kubernetes that uses eBPF to automatically capture telemetry data from clusters without requiring any manual instrumentation or code changes. It functions as an eBPF telemetry collector, a continuous application profiler, a network traffic analyzer, and a scriptable telemetry query engine, all within a single Kubernetes-native tool. The platform distinguishes itself through several integrated capabilities. It continuously samples stack traces from compiled-language code to identify CPU performance bottlenecks, visualizing the results as inter
Transforms tabular telemetry data through immutable dataframe operations for observability analysis.
Ce projet est une bibliothèque d'analyse de données Python et un framework d'analyse exploratoire de données conçu pour traiter des jeux de données bruts. Il fournit une suite d'outils pour examiner les données, identifier les anomalies et appliquer des méthodes statistiques pour découvrir des modèles. Le dépôt fonctionne comme une boîte à outils de modélisation de machine learning et une suite de modélisation statistique de données. Il inclut des algorithmes prédictifs et des modèles mathématiques utilisés pour analyser les relations entre les variables de données et tirer des enseignements de jeux de données complexes. Le projet couvre un large éventail de capacités, notamment la science des données, la modélisation par machine learning et l'analyse exploratoire de données. Celles-ci sont implémentées via la manipulation de données, le calcul numérique et la visualisation de données.
Provides capabilities to perform numerical transformations and filtering on tabular data structures to derive insights.
This is a comprehensive Python programming course and technical curriculum designed to take users from foundational syntax to advanced development patterns. It serves as a multi-disciplinary educational suite covering programming fundamentals, object-oriented design, and data analysis. The project provides specialized guides on professional development techniques, including the use of decorators, generators for memory management, and dunder-method operator overloading. It also includes instructional material on executing parallel tasks through concurrency and multiprocessing to reduce executi
Provides a suite for loading structured datasets and performing numerical transformations using DataFrames.