Why is donnemartin/data-science-ipython-notebooks a recommended Dataframe Processing GitHub Repositories repository?

Provides practical guides for manipulating tabular datasets using dataframe abstractions for statistical and machine learning workflows.

Why is gventuri/pandas-ai a recommended Dataframe Processing GitHub Repositories repository?

Provides a common API to allow uniform querying across various data structures and table formats.

Why is vonng/ddia a recommended Dataframe Processing GitHub Repositories repository?

Manipulates tabular datasets through programmatic transformations for statistical analysis.

Why is dask/dask a recommended Dataframe Processing GitHub Repositories repository?

Converts tabular data structures into unordered collections to facilitate flexible processing patterns.

Why is ydataai/pandas-profiling a recommended Dataframe Processing GitHub Repositories repository?

Implements a unified interface that allows the same analysis logic to run on both Pandas and Spark dataframes.

Why is data-centric-ai-community/fg-data-profiling a recommended Dataframe Processing GitHub Repositories repository?

Implements a unified API to execute profiling logic across both Pandas and Spark data structures.

Why is pandas-profiling/pandas-profiling a recommended Dataframe Processing GitHub Repositories repository?

Implements a structured pipeline that processes Pandas and Spark dataframes through sequential statistical and type-inference stages.

Why is perspective-dev/perspective a recommended Dataframe Processing GitHub Repositories repository?

Exposes in-memory Polars DataFrames to browser clients over a WebSocket connection for remote analysis.

Why is modin-project/modin a recommended Dataframe Processing GitHub Repositories repository?

Distributes data and computations across all available CPU cores to accelerate processing speeds.

Why is goldmansachs/gs-quant a recommended Dataframe Processing GitHub Repositories repository?

Provides tabular data manipulation capabilities for processing financial time series and risk metrics.

31 repository-uri

Awesome GitHub RepositoriesDataframe Processing

Programmatic manipulation of tabular datasets for statistical and machine learning workflows.

Distinct from Data Processing: Distinct from general data processing: focuses specifically on the dataframe abstraction for tabular data manipulation.

Explore 31 awesome GitHub repositories matching data & databases · Dataframe Processing. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

donnemartin/data-science-ipython-notebooks
donnemartin/data-science-ipython-notebooks
29,166Vezi pe GitHub
This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The repository features specialized instructional suites for implementing classical machine learning algorithms, building deep learning model architectures, and managing AWS cloud infrastructure. It includes dedicated notebooks for data visualization and numerical computing exercises. The project covers
Provides practical guides for manipulating tabular datasets using dataframe abstractions for statistical and machine learning workflows.
Pythonawsbig-datacaffe
Vezi pe GitHub29,166
gventuri/pandas-ai
gventuri/pandas-ai
23,587Vezi pe GitHub
Pandas AI is a data analysis library and natural language interface that uses large language models to perform conversational querying on structured datasets. It functions as a retrieval-augmented generation framework designed to translate plain text questions into executable code for extracting insights from dataframes and structured files. The system includes a dedicated sandbox execution environment that runs AI-generated analysis code within an isolated container to prevent security risks and system compromise. It employs a natural language translation layer and contextual retrieval to ma
Provides a common API to allow uniform querying across various data structures and table formats.
Python
Vezi pe GitHub23,587
vonng/ddia
Vonng/ddia
22,648Vezi pe GitHub
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Manipulates tabular datasets through programmatic transformations for statistical analysis.
Pythonbookdatabaseddia
Vezi pe GitHub22,648
dask/dask
dask/dask
13,746Vezi pe GitHub
Dask este un framework de calcul paralel și un scheduler de sarcini distribuit conceput pentru a scala fluxurile de lucru de știința datelor în Python de la mașini individuale la clustere mari. Acesta funcționează ca un manager de resurse de cluster care orchestrează logica computațională prin reprezentarea sarcinilor și a dependențelor acestora sub formă de grafuri aciclice direcționate. Această arhitectură permite sistemului să automatizeze distribuția sarcinilor de lucru pe hardware-ul disponibil, gestionând în același timp cerințe complexe de execuție. Proiectul se distinge printr-un motor de evaluare leneșă (lazy) care amână operațiunile pe date până când sunt solicitate explicit, permițând optimizarea globală a grafului și alocarea eficientă a resurselor. Acesta încorporează „spilling” de date conștient de memorie pentru a preveni blocarea sistemului la procesarea seturilor de date care depășesc memoria disponibilă și utilizează fuziunea grafului de sarcini pentru a combina secvențe de operațiuni în pași de execuție unici, minimizând overhead-ul de programare și comunicarea între noduri. Platforma oferă o suprafață cuprinzătoare de capabilități pentru analiza datelor la scară largă, inclusiv suport pentru învățare automată distribuită, integrare cu calcul de înaltă performanță și procesare paralelă a datelor. Oferă instrumente extinse pentru gestionarea ciclului de viață al clusterului, profilarea performanței și monitorizarea în timp real a execuției sarcinilor. Utilizatorii pot implementa aceste medii pe diverse infrastructuri, inclusiv hardware local, furnizori de cloud, sisteme containerizate și clustere de calcul de înaltă performanță.
Converts tabular data structures into unordered collections to facilitate flexible processing patterns.
Pythondasknumpypandas
Vezi pe GitHub13,746
ydataai/pandas-profiling
ydataai/pandas-profiling
13,610Vezi pe GitHub
This project is an exploratory data analysis framework and profiling tool designed to generate comprehensive statistical reports from Pandas and Spark DataFrames. It functions as a data quality profiler that identifies missing values, duplicates, and high correlations within tabular datasets. The tool distinguishes itself through specialized capabilities for time-series analysis, extracting temporal statistics, seasonality, and auto-correlation plots. It also includes a dataset comparison utility to identify structural or content changes between different versions of a dataset. The analysis
Implements a unified interface that allows the same analysis logic to run on both Pandas and Spark dataframes.
Python
Vezi pe GitHub13,610
data-centric-ai-community/fg-data-profiling
Data-Centric-AI-Community/fg-data-profiling
13,609Vezi pe GitHub
This project is a data profiling and exploratory data analysis tool designed to generate automated quality reports for Pandas and Spark dataframes. It serves as a system for computing descriptive statistics, identifying correlations, and analyzing univariate and multivariate data patterns. The tool provides specialized capabilities for comparing different versions of datasets to identify changes in data quality and distributions. It includes a dedicated profiler for time-dependent data to extract statistical information such as seasonality and auto-correlation. The software covers a broad an
Implements a unified API to execute profiling logic across both Pandas and Spark data structures.
Python
Vezi pe GitHub13,609
pandas-profiling/pandas-profiling
pandas-profiling/pandas-profiling
13,609Vezi pe GitHub
This project is an exploratory data analysis library and profiling tool for Pandas and Spark DataFrames. It automates the initial investigation of datasets by generating comprehensive descriptive analysis reports, statistical summaries, and data quality warnings. The system functions as a data quality profiler to detect missing values, duplicate rows, and type inconsistencies. It includes a dataset comparison tool for identifying structural and content shifts between different versions of the same data, as well as specialized tools for time-series analysis to calculate auto-correlation and se
Implements a structured pipeline that processes Pandas and Spark dataframes through sequential statistical and type-inference stages.
Python
Vezi pe GitHub13,609
perspective-dev/perspective
perspective-dev/perspective
10,981Vezi pe GitHub
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Exposes in-memory Polars DataFrames to browser clients over a WebSocket connection for remote analysis.
C++analyticsbidata-visualization
Vezi pe GitHub10,981
modin-project/modin
modin-project/modin
10,389Vezi pe GitHub
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Distributes data and computations across all available CPU cores to accelerate processing speeds.
Pythonanalyticsdata-sciencedataframe
Vezi pe GitHub10,389
goldmansachs/gs-quant
goldmansachs/gs-quant
9,912Vezi pe GitHub
gs-quant is a quantitative finance library and financial data analytics toolkit. It serves as a framework for analyzing financial data, developing systematic trading strategies, and managing risk exposure for derivative products in global markets. The project provides tools for quantitative financial analysis, quantitative portfolio modeling, and the development of systematic trading strategies. It enables the calculation of risk for derivative products to structure and hedge positions across markets.
Provides tabular data manipulation capabilities for processing financial time series and risk metrics.
Jupyter Notebookderivativesgoldman-sachsgs-quant
Vezi pe GitHub9,912
hrsh7th/nvim-cmp
hrsh7th/nvim-cmp
9,455Vezi pe GitHub
This project is a Lua-based completion engine for Neovim that aggregates real-time text suggestions from multiple data sources into a single interface. It functions as a modular framework for extending the editor with custom completion logic, acting as both a fuzzy text suggestion tool and an interface for the Language Server Protocol. The engine utilizes a source-agnostic provider interface to standardize how disparate data sources feed candidates into a central logic engine. It employs asynchronous candidate fetching and a non-blocking architecture to retrieve suggestions from external serv
Standardizes how disparate data sources feed completion candidates into the central engine via a common Lua API.
Lua
Vezi pe GitHub9,455
typecellos/blocknote
TypeCellOS/BlockNote
9,141Vezi pe GitHub
BlockNote is a block-based rich text editor and a real-time collaborative workspace. It uses a JSON-based data model to organize content into draggable, nestable blocks rather than a single flat document. The system functions as a high-level interface built on ProseMirror that abstracts document state into discrete, manipulatable content blocks. The project serves as a framework for integrating large language models into document editors, enabling context-aware text generation and AI-driven workflows. It also acts as a document export engine capable of converting structured block data into fo
Provides a pluggable synchronization layer that abstracts the communication between the editor and external sync services.
TypeScriptblock-basededitorjavascript
Vezi pe GitHub9,141
apache/beam
apache/beam
8,612Vezi pe GitHub
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Manipulates data using a tabular API to execute common transformations at scale.
Java
Vezi pe GitHub8,612
vaexio/vaex
vaexio/vaex
8,506Vezi pe GitHub
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Offers a programmatic dataframe abstraction for high-performance manipulation of billion-row datasets.
Python
Vezi pe GitHub8,506
microsoft/c9-python-getting-started
microsoft/c9-python-getting-started
8,012Vezi pe GitHub
This project is a Python education repository and programming tutorial designed to teach language fundamentals, from basic syntax and variables to advanced concepts. It serves as a data science starter kit and a guide for REST API integration. The repository provides instructional scripts and sample code covering object-oriented programming patterns and asynchronous programming. It includes practical demonstrations for fetching and processing JSON data from external web services using HTTP requests. The materials cover a broad capability surface including data analysis workflows with interac
Demonstrates programmatic manipulation of tabular datasets using DataFrames for analytical workflows.
Jupyter Notebook
Vezi pe GitHub8,012
amueller/introduction_to_ml_with_python
amueller/introduction_to_ml_with_python
8,025Vezi pe GitHub
This project is a Python machine learning education kit that provides curated datasets and visualization scripts to teach fundamental machine learning concepts. It functions as both a machine learning visualization library and a collection of educational datasets designed for demonstrating and testing common models and patterns. The toolkit focuses on illustrating the internal logic and operational patterns of machine learning algorithms. It generates figures and datasets that visualize how different models behave and operate on data to aid in the learning process. The implementation utilize
Uses dataframe abstractions for the programmatic manipulation and cleaning of tabular educational datasets.
Jupyter Notebook
Vezi pe GitHub8,025
codebasics/py
codebasics/py
7,262Vezi pe GitHub
This project is a Python data science curriculum and programming tutorial collection. It provides a structured set of educational notebooks and scripts designed to teach data analysis, machine learning, and deep learning. The repository serves as a learning path for building and tuning predictive models, including regression, decision trees, and neural networks. It includes a data visualization guide for creating financial time-series plots and a multiprocessing reference for implementing parallel task execution and shared memory synchronization. The curriculum covers broader capability area
Provides instruction and scripts for programmatic manipulation of tabular datasets using the dataframe abstraction.
Jupyter Notebookjupyterjupyter-notebookjupyter-notebooks
Vezi pe GitHub7,262
pixie-io/pixie
pixie-io/pixie
6,467Vezi pe GitHub
Pixie is an open-source observability platform for Kubernetes that uses eBPF to automatically capture telemetry data from clusters without requiring any manual instrumentation or code changes. It functions as an eBPF telemetry collector, a continuous application profiler, a network traffic analyzer, and a scriptable telemetry query engine, all within a single Kubernetes-native tool. The platform distinguishes itself through several integrated capabilities. It continuously samples stack traces from compiled-language code to identify CPU performance bottlenecks, visualizing the results as inter
Processes telemetry data through a chain of immutable dataframe operations with automatic optimization.
C++
Vezi pe GitHub6,467
willkoehrsen/data-analysis
WillKoehrsen/Data-Analysis
5,543Vezi pe GitHub
Acest proiect este o bibliotecă Python de analiză a datelor și un framework de analiză exploratorie a datelor conceput pentru procesarea seturilor de date brute. Oferă o suită de instrumente pentru examinarea datelor, identificarea anomaliilor și aplicarea metodelor statistice pentru a descoperi tipare. Repository-ul funcționează ca un toolkit de modelare machine learning și o suită de modelare statistică a datelor. Include algoritmi predictivi și modele matematice utilizate pentru a analiza relațiile dintre variabilele de date și a deriva insight-uri din seturi de date complexe. Proiectul acoperă o gamă largă de capabilități, inclusiv data science, modelare machine learning și analiză exploratorie a datelor. Acestea sunt implementate prin manipularea datelor, calcul numeric și vizualizarea datelor.
Provides capabilities to perform numerical transformations and filtering on tabular data structures to derive insights.
Jupyter Notebook
Vezi pe GitHub5,543
eventual-inc/daft
Eventual-Inc/Daft
5,225Vezi pe GitHub
Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types. The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions witho
Provides a distributed dataframe library for processing large-scale structured and unstructured data across local cores or Kubernetes clusters.
Rustai-engineeringai-pipelinearrow
Vezi pe GitHub5,225