What are the best Awesome Data Processing GitHub Repositories?

Tools and frameworks that perform computational operations, transformations, and analysis on raw data sets. Explore 314 awesome GitHub repositories matching data & databases · Data Processing. Refine with filters or upvote what's useful. Top picks: vinta/awesome-python, macrozheng/mall, elastic/elasticsearch, abi/screenshot-to-code, josephmisiti/awesome-machine-learning, hiyouga/llama-factory, fffaraz/awesome-cpp, openbb-finance/openbb, unclecode/crawl4ai, unslothai/unsloth.

Why is vinta/awesome-python a recommended Data Processing GitHub Repositories repository?

Enable fast, relevant query results across datasets through high-performance indexing and full-text search capabilities.

Why is macrozheng/mall a recommended Data Processing GitHub Repositories repository?

Offloads complex query operations to a distributed cluster to provide high-performance full-text retrieval.

Why is elastic/elasticsearch a recommended Data Processing GitHub Repositories repository?

Delivers high-performance full-text search capabilities with advanced relevance ranking and complex filtering on unstructured datasets.

Why is abi/screenshot-to-code a recommended Data Processing GitHub Repositories repository?

Extracts temporal and spatial information from video recordings to reconstruct interaction flows and dynamic UI states in generated code.

Why is josephmisiti/awesome-machine-learning a recommended Data Processing GitHub Repositories repository?

Enables large-scale computation through distributed frameworks designed for parallelized data processing and analytics.

Why is hiyouga/llama-factory a recommended Data Processing GitHub Repositories repository?

Manages training data pipelines that integrate cloud/local storage with synthetic data generation.

Why is fffaraz/awesome-cpp a recommended Data Processing GitHub Repositories repository?

Organizes data compression and archiving utilities supporting formats like 7-zip and Brotli.

Why is openbb-finance/openbb a recommended Data Processing GitHub Repositories repository?

Enforces standardized data structures to ensure information from heterogeneous financial APIs remains consistent throughout the research pipeline.

Why is unclecode/crawl4ai a recommended Data Processing GitHub Repositories repository?

Maps unstructured web content into predefined data structures using automated path selection or intelligent language model analysis.

Why is unslothai/unsloth a recommended Data Processing GitHub Repositories repository?

Structures and generates synthetic training data via visual workflows to improve model learning efficacy.

314 repository-uri

Awesome GitHub RepositoriesData Processing

Tools and frameworks that perform computational operations, transformations, and analysis on raw data sets.

Explore 314 awesome GitHub repositories matching data & databases · Data Processing. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

vinta/awesome-python
vinta/awesome-python
303,207Vezi pe GitHub
Acest proiect este un director cuprinzător, curatoriat de comunitate, care organizează un peisaj vast de biblioteci, framework-uri și instrumente software Python. Servește drept bază de cunoștințe centralizată concepută pentru a facilita navigarea în ecosistem și a accelera descoperirea de către dezvoltatori pe parcursul întregului ciclu de viață al dezvoltării software. Directorul se distinge prin furnizarea unui index structurat de resurse categorisite pe domeniu tehnic, variind de la utilitare fundamentale de dezvoltare la domenii de inginerie specializate. Acoperă capabilități de nivel înalt, inclusiv inteligență artificială, știința datelor, dezvoltare web și gestionarea infrastructurii, permițând dezvoltatorilor să identifice soluții verificate pentru provocări tehnice specifice. Proiectul cuprinde o suprafață largă de capabilități, inclusiv instrumente pentru gestionarea dependențelor, analiza statică a codului și testarea automatizată. De asemenea, cataloghează resurse pentru stocarea persistentă a datelor, orchestrarea infrastructurii cloud și dezvoltarea interfețelor, oferind o referință unificată pentru construirea și menținerea sistemelor software complexe.
Enable fast, relevant query results across datasets through high-performance indexing and full-text search capabilities.
Pythonawesomecollectionspython
Vezi pe GitHub303,207
macrozheng/mall
macrozheng/mall
83,878Vezi pe GitHub
This project is an enterprise-grade Java framework designed for building scalable, full-stack e-commerce applications. It provides a comprehensive foundation for microservice-based distributed architectures, enabling the development of complex retail platforms that include product management, order processing, and secure user authentication. By leveraging modular service patterns and centralized API gateways, the framework supports the construction of resilient systems that decompose monolithic business logic into independent, manageable services. The platform distinguishes itself through a r
Offloads complex query operations to a distributed cluster to provide high-performance full-text retrieval.
Javadockerelasticsearchelk
Vezi pe GitHub83,878
elastic/elasticsearch
elastic/elasticsearch
77,012Vezi pe GitHub
Elasticsearch is a distributed search engine and document store designed for the high-performance indexing and retrieval of massive volumes of unstructured data. It functions as a centralized analytics platform, providing a schema-flexible architecture that organizes information into searchable indices while maintaining global cluster state through a distributed consensus mechanism. The platform distinguishes itself through its integrated approach to observability, security, and advanced analytics. It combines full-text, vector, and hybrid search capabilities with machine learning-driven insi
Delivers high-performance full-text search capabilities with advanced relevance ranking and complex filtering on unstructured datasets.
Javaelasticsearchjavasearch-engine
Vezi pe GitHub77,012
abi/screenshot-to-code
abi/screenshot-to-code
72,926Vezi pe GitHub
This project is an artificial intelligence-powered frontend generator that translates visual design inputs into functional source code. It functions as a workflow engine that interprets graphical user interfaces, mapping layout structures and styling rules to structured markup and programming language syntax. The tool distinguishes itself by supporting both static design mockups and dynamic video recordings. It processes temporal and spatial information from screen captures to reconstruct interaction flows and state transitions, enabling the creation of functional software prototypes from vis
Extracts temporal and spatial information from video recordings to reconstruct interaction flows and dynamic UI states in generated code.
Python
Vezi pe GitHub72,926
josephmisiti/awesome-machine-learning
josephmisiti/awesome-machine-learning
72,867Vezi pe GitHub
This project is a comprehensive, community-driven directory of machine learning resources, software libraries, and educational materials. It serves as a centralized knowledge base for developers and researchers, organizing tools and frameworks by their primary programming language and technical domain to simplify discovery across the artificial intelligence ecosystem. The collection distinguishes itself by providing a cross-language development index that spans diverse programming environments, including C, C++, Rust, Clojure, and Python. It covers a wide range of specialized capabilities, fr
Enables large-scale computation through distributed frameworks designed for parallelized data processing and analytics.
Python
Vezi pe GitHub72,867
hiyouga/llama-factory
hiyouga/LLaMA-Factory
72,241Vezi pe GitHub
LLaMA-Factory is a comprehensive suite for dataset preparation, model fine-tuning, memory optimization, and standardized API deployment. It provides a unified platform for the supervised and reward-based fine-tuning of large language models and vision-language models. The framework includes a specialized toolkit for training vision-language models and a model serving interface that deploys trained models through high-performance APIs. It utilizes precision tuning and quantization techniques to reduce the hardware requirements and memory footprint of large models. The system covers data pipel
Manages training data pipelines that integrate cloud/local storage with synthetic data generation.
Python
Vezi pe GitHub72,241
fffaraz/awesome-cpp
fffaraz/awesome-cpp
71,817Vezi pe GitHub
This project is a comprehensive, curated directory of high-quality libraries, tools, and educational resources for C and C++ development. It serves as an ecosystem discovery index, helping developers navigate the vast landscape of third-party components, frameworks, and technical documentation available for the language. The collection is distinguished by its focus on high-performance systems programming and technical mastery. It provides deep coverage of specialized domains including SIMD-accelerated data processing, compile-time template metaprogramming, and asynchronous event-driven archit
Organizes data compression and archiving utilities supporting formats like 7-zip and Brotli.
awesomeawesome-listc
Vezi pe GitHub71,817
openbb-finance/openbb
OpenBB-finance/OpenBB
69,583Vezi pe GitHub
OpenBB is a financial data platform and investment research terminal designed to aggregate, normalize, and distribute market data across analytical workflows. It functions as a comprehensive ecosystem that bridges disparate financial data providers with custom applications, spreadsheets, and internal modeling infrastructure. The platform distinguishes itself through a provider-based data abstraction layer that normalizes heterogeneous financial APIs into a consistent, schema-driven format. This architecture supports quantitative research automation and the construction of interactive, widget-
Enforces standardized data structures to ensure information from heterogeneous financial APIs remains consistent throughout the research pipeline.
Pythonaicryptoderivatives
Vezi pe GitHub69,583
unclecode/crawl4ai
unclecode/crawl4ai
68,644Vezi pe GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
Maps unstructured web content into predefined data structures using automated path selection or intelligent language model analysis.
Python
Vezi pe GitHub68,644
unslothai/unsloth
unslothai/unsloth
66,628Vezi pe GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fin
Structures and generates synthetic training data via visual workflows to improve model learning efficacy.
Pythonagentdeepseekdeepseek-r1
Vezi pe GitHub66,628
scikit-learn/scikit-learn
scikit-learn/scikit-learn
66,344Vezi pe GitHub
Scikit-learn is a machine learning library for predictive data analysis that provides a collection of algorithms for supervised and unsupervised learning. It functions as a comprehensive toolkit for data preprocessing, dimensionality reduction, and model selection, allowing users to classify data objects, predict continuous values, and cluster similar items based on historical patterns. The project is defined by a unified interface design where objects either learn from data, transform data, or chain these operations into sequential workflows. To ensure performance on large or high-dimensiona
Extracts and scales features to ensure raw data meets the strict input requirements of machine learning models.
Pythondata-analysisdata-sciencemachine-learning
Vezi pe GitHub66,344
sindresorhus/awesome-nodejs
sindresorhus/awesome-nodejs
65,973Vezi pe GitHub
This project is a community-driven directory that aggregates essential software projects and educational content for the Node.js ecosystem. It functions as a centralized knowledge base and discovery index, designed to simplify the navigation of a fragmented technical landscape by providing a structured collection of high-quality links, tools, and learning materials. The repository distinguishes itself through a decentralized, peer-reviewed curation model. By utilizing standard version control workflows and pull requests, the community ensures that all listed resources undergo human verificati
Presents efficient modules for reducing file sizes and managing archive formats within data-heavy applications.
awesomeawesome-listjavascript
Vezi pe GitHub65,973
fchollet/keras
fchollet/keras
64,095Vezi pe GitHub
Keras is a high-level deep learning API used to design, build, and train neural networks for tasks such as computer vision, natural language processing, and time series forecasting. It provides a framework for defining model architectures and optimizing weights through a structured interface. The project is defined by a backend-agnostic design that allows the same model code to run across different compute engines. This multi-backend execution enables users to swap underlying engines to optimize for specific hardware or performance requirements. The system supports distributed model training
Supports various standardized dataset formats for organizing training data used in model development.
Python
Vezi pe GitHub64,095
keras-team/keras
keras-team/keras
64,094Vezi pe GitHub
Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectures. By utilizing a directed acyclic graph approach, the framework allows users to build intricate models with multiple inputs, outputs, and shared layers, ensuring consistent numerical execution through functional state management. The project distinguishes itself as a multi-backend machine learning
Integrates utilities to load, preprocess, and format diverse data types for efficient training pipelines.
Pythondata-sciencedeep-learningjax
Vezi pe GitHub64,094
pathwaycom/pathway
pathwaycom/pathway
62,959Vezi pe GitHub
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Processes continuous data streams in real-time to facilitate immediate event-driven analytics.
Pythonbatch-processingdata-analyticsdata-pipelines
Vezi pe GitHub62,959
docling-project/docling
docling-project/docling
61,674Vezi pe GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Defines specific input types and file formats to ensure that documents are processed according to custom requirements.
Pythonaiconvertdocument-parser
Vezi pe GitHub61,674
xingshaocheng/architect-awesome
xingshaocheng/architect-awesome
60,821Vezi pe GitHub
This project serves as a comprehensive knowledge base and reference for distributed systems engineering and enterprise software architecture. It provides a structured collection of technical resources, design patterns, and methodologies intended to assist in the design, maintenance, and scaling of complex, high-performance software environments. The repository distinguishes itself by offering deep dives into core architectural concepts such as actor-based concurrency, aspect-oriented interception, and inversion-of-control containers. It emphasizes the practical application of distributed syst
Deploy enterprise-grade search platforms to provide advanced filtering, faceting, and relevance ranking for large-scale datasets.
Vezi pe GitHub60,821
plexpt/awesome-chatgpt-prompts-zh
PlexPt/awesome-chatgpt-prompts-zh
60,656Vezi pe GitHub
This project is a community-driven library of structured text inputs designed to guide large language models into specific roles, behaviors, and operational modes. It functions as a comprehensive repository of prompt engineering resources, providing reusable templates that allow users to override default model tendencies and enforce domain-specific response patterns through instruction-following logic. The collection distinguishes itself by offering specialized persona-based directives that constrain model output to simulate professional experts or functional technical environments. By utiliz
Simulation prompts replicate search engine query syntax and indexing behaviors for testing and development purposes.
chat-gptchatgptchatgpt3
Vezi pe GitHub60,656
karpathy/nanogpt
karpathy/nanoGPT
59,730Vezi pe GitHub
nanoGPT is a lightweight engine for training and fine-tuning transformer-based language models from scratch. It provides a minimalist codebase designed for educational exploration and rapid experimentation with neural network architectures, utilizing self-attention and feed-forward layers to process sequences and predict subsequent elements. The project distinguishes itself through a focus on high-speed data ingestion and hardware-accelerated performance. It includes a dedicated pipeline for transforming raw text into memory-mapped binary files, which enables efficient streaming during traini
Stores data in memory-mapped binary structures to facilitate rapid sequential access during training.
Python
Vezi pe GitHub59,730
pathwaycom/llm-app
pathwaycom/llm-app
59,341Vezi pe GitHub
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Ingests and processes information from diverse sources in real-time to ensure continuous visibility into changing data.
Jupyter Notebookchatbothugging-facellm
Vezi pe GitHub59,341