31 repository-uri
Programmatic manipulation of tabular datasets for statistical and machine learning workflows.
Distinct from Data Processing: Distinct from general data processing: focuses specifically on the dataframe abstraction for tabular data manipulation.
Explore 31 awesome GitHub repositories matching data & databases · Dataframe Processing. Refine with filters or upvote what's useful.
This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The repository features specialized instructional suites for implementing classical machine learning algorithms, building deep learning model architectures, and managing AWS cloud infrastructure. It includes dedicated notebooks for data visualization and numerical computing exercises. The project covers
Provides practical guides for manipulating tabular datasets using dataframe abstractions for statistical and machine learning workflows.
Pandas AI is a data analysis library and natural language interface that uses large language models to perform conversational querying on structured datasets. It functions as a retrieval-augmented generation framework designed to translate plain text questions into executable code for extracting insights from dataframes and structured files. The system includes a dedicated sandbox execution environment that runs AI-generated analysis code within an isolated container to prevent security risks and system compromise. It employs a natural language translation layer and contextual retrieval to ma
Provides a common API to allow uniform querying across various data structures and table formats.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Manipulates tabular datasets through programmatic transformations for statistical analysis.
Dask este un framework de calcul paralel și un scheduler de sarcini distribuit conceput pentru a scala fluxurile de lucru de știința datelor în Python de la mașini individuale la clustere mari. Acesta funcționează ca un manager de resurse de cluster care orchestrează logica computațională prin reprezentarea sarcinilor și a dependențelor acestora sub formă de grafuri aciclice direcționate. Această arhitectură permite sistemului să automatizeze distribuția sarcinilor de lucru pe hardware-ul disponibil, gestionând în același timp cerințe complexe de execuție. Proiectul se distinge printr-un motor de evaluare leneșă (lazy) care amână operațiunile pe date până când sunt solicitate explicit, permițând optimizarea globală a grafului și alocarea eficientă a resurselor. Acesta încorporează „spilling” de date conștient de memorie pentru a preveni blocarea sistemului la procesarea seturilor de date care depășesc memoria disponibilă și utilizează fuziunea grafului de sarcini pentru a combina secvențe de operațiuni în pași de execuție unici, minimizând overhead-ul de programare și comunicarea între noduri. Platforma oferă o suprafață cuprinzătoare de capabilități pentru analiza datelor la scară largă, inclusiv suport pentru învățare automată distribuită, integrare cu calcul de înaltă performanță și procesare paralelă a datelor. Oferă instrumente extinse pentru gestionarea ciclului de viață al clusterului, profilarea performanței și monitorizarea în timp real a execuției sarcinilor. Utilizatorii pot implementa aceste medii pe diverse infrastructuri, inclusiv hardware local, furnizori de cloud, sisteme containerizate și clustere de calcul de înaltă performanță.
Converts tabular data structures into unordered collections to facilitate flexible processing patterns.
This project is an exploratory data analysis framework and profiling tool designed to generate comprehensive statistical reports from Pandas and Spark DataFrames. It functions as a data quality profiler that identifies missing values, duplicates, and high correlations within tabular datasets. The tool distinguishes itself through specialized capabilities for time-series analysis, extracting temporal statistics, seasonality, and auto-correlation plots. It also includes a dataset comparison utility to identify structural or content changes between different versions of a dataset. The analysis
Implements a unified interface that allows the same analysis logic to run on both Pandas and Spark dataframes.
This project is a data profiling and exploratory data analysis tool designed to generate automated quality reports for Pandas and Spark dataframes. It serves as a system for computing descriptive statistics, identifying correlations, and analyzing univariate and multivariate data patterns. The tool provides specialized capabilities for comparing different versions of datasets to identify changes in data quality and distributions. It includes a dedicated profiler for time-dependent data to extract statistical information such as seasonality and auto-correlation. The software covers a broad an
Implements a unified API to execute profiling logic across both Pandas and Spark data structures.
This project is an exploratory data analysis library and profiling tool for Pandas and Spark DataFrames. It automates the initial investigation of datasets by generating comprehensive descriptive analysis reports, statistical summaries, and data quality warnings. The system functions as a data quality profiler to detect missing values, duplicate rows, and type inconsistencies. It includes a dataset comparison tool for identifying structural and content shifts between different versions of the same data, as well as specialized tools for time-series analysis to calculate auto-correlation and se
Implements a structured pipeline that processes Pandas and Spark dataframes through sequential statistical and type-inference stages.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Exposes in-memory Polars DataFrames to browser clients over a WebSocket connection for remote analysis.
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Distributes data and computations across all available CPU cores to accelerate processing speeds.
gs-quant is a quantitative finance library and financial data analytics toolkit. It serves as a framework for analyzing financial data, developing systematic trading strategies, and managing risk exposure for derivative products in global markets. The project provides tools for quantitative financial analysis, quantitative portfolio modeling, and the development of systematic trading strategies. It enables the calculation of risk for derivative products to structure and hedge positions across markets.
Provides tabular data manipulation capabilities for processing financial time series and risk metrics.
This project is a Lua-based completion engine for Neovim that aggregates real-time text suggestions from multiple data sources into a single interface. It functions as a modular framework for extending the editor with custom completion logic, acting as both a fuzzy text suggestion tool and an interface for the Language Server Protocol. The engine utilizes a source-agnostic provider interface to standardize how disparate data sources feed candidates into a central logic engine. It employs asynchronous candidate fetching and a non-blocking architecture to retrieve suggestions from external serv
Standardizes how disparate data sources feed completion candidates into the central engine via a common Lua API.
BlockNote is a block-based rich text editor and a real-time collaborative workspace. It uses a JSON-based data model to organize content into draggable, nestable blocks rather than a single flat document. The system functions as a high-level interface built on ProseMirror that abstracts document state into discrete, manipulatable content blocks. The project serves as a framework for integrating large language models into document editors, enabling context-aware text generation and AI-driven workflows. It also acts as a document export engine capable of converting structured block data into fo
Provides a pluggable synchronization layer that abstracts the communication between the editor and external sync services.
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Manipulates data using a tabular API to execute common transformations at scale.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Offers a programmatic dataframe abstraction for high-performance manipulation of billion-row datasets.
This project is a Python education repository and programming tutorial designed to teach language fundamentals, from basic syntax and variables to advanced concepts. It serves as a data science starter kit and a guide for REST API integration. The repository provides instructional scripts and sample code covering object-oriented programming patterns and asynchronous programming. It includes practical demonstrations for fetching and processing JSON data from external web services using HTTP requests. The materials cover a broad capability surface including data analysis workflows with interac
Demonstrates programmatic manipulation of tabular datasets using DataFrames for analytical workflows.
This project is a Python machine learning education kit that provides curated datasets and visualization scripts to teach fundamental machine learning concepts. It functions as both a machine learning visualization library and a collection of educational datasets designed for demonstrating and testing common models and patterns. The toolkit focuses on illustrating the internal logic and operational patterns of machine learning algorithms. It generates figures and datasets that visualize how different models behave and operate on data to aid in the learning process. The implementation utilize
Uses dataframe abstractions for the programmatic manipulation and cleaning of tabular educational datasets.
This project is a Python data science curriculum and programming tutorial collection. It provides a structured set of educational notebooks and scripts designed to teach data analysis, machine learning, and deep learning. The repository serves as a learning path for building and tuning predictive models, including regression, decision trees, and neural networks. It includes a data visualization guide for creating financial time-series plots and a multiprocessing reference for implementing parallel task execution and shared memory synchronization. The curriculum covers broader capability area
Provides instruction and scripts for programmatic manipulation of tabular datasets using the dataframe abstraction.
Pixie is an open-source observability platform for Kubernetes that uses eBPF to automatically capture telemetry data from clusters without requiring any manual instrumentation or code changes. It functions as an eBPF telemetry collector, a continuous application profiler, a network traffic analyzer, and a scriptable telemetry query engine, all within a single Kubernetes-native tool. The platform distinguishes itself through several integrated capabilities. It continuously samples stack traces from compiled-language code to identify CPU performance bottlenecks, visualizing the results as inter
Processes telemetry data through a chain of immutable dataframe operations with automatic optimization.
Acest proiect este o bibliotecă Python de analiză a datelor și un framework de analiză exploratorie a datelor conceput pentru procesarea seturilor de date brute. Oferă o suită de instrumente pentru examinarea datelor, identificarea anomaliilor și aplicarea metodelor statistice pentru a descoperi tipare. Repository-ul funcționează ca un toolkit de modelare machine learning și o suită de modelare statistică a datelor. Include algoritmi predictivi și modele matematice utilizate pentru a analiza relațiile dintre variabilele de date și a deriva insight-uri din seturi de date complexe. Proiectul acoperă o gamă largă de capabilități, inclusiv data science, modelare machine learning și analiză exploratorie a datelor. Acestea sunt implementate prin manipularea datelor, calcul numeric și vizualizarea datelor.
Provides capabilities to perform numerical transformations and filtering on tabular data structures to derive insights.
Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types. The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions witho
Provides a distributed dataframe library for processing large-scale structured and unstructured data across local cores or Kubernetes clusters.