What are the best Awesome Data Processing Pipelines GitHub Repositories?

Systems and workflows for ingesting, transforming, and orchestrating high-throughput data processing tasks. Explore 1,178 awesome GitHub repositories matching data & databases · Data Processing Pipelines. Refine with filters or upvote what's useful. Top picks: kamranahmedse/developer-roadmap, jwasham/coding-interview-university, donnemartin/system-design-primer, vinta/awesome-python, thealgorithms/python, vuejs/vue, tensorflow/tensorflow, n8n-io/n8n, significant-gravitas/autogpt, avelino/aweso…

Why is kamranahmedse/developer-roadmap a recommended Data Processing Pipelines GitHub Repositories repository?

Provides sequential access to elements within large data collections during processing.

Why is jwasham/coding-interview-university a recommended Data Processing Pipelines GitHub Repositories repository?

Reduces data footprint using encoding algorithms to enhance storage efficiency and transmission performance.

Why is donnemartin/system-design-primer a recommended Data Processing Pipelines GitHub Repositories repository?

Provides helper libraries and scripts that assist in the scheduling, monitoring, and management of batch processing jobs.

Why is vinta/awesome-python a recommended Data Processing Pipelines GitHub Repositories repository?

Enable fast, relevant query results across datasets through high-performance indexing and full-text search capabilities.

Why is thealgorithms/python a recommended Data Processing Pipelines GitHub Repositories repository?

Shrink digital information streams through encoding techniques to improve storage density and transmission speeds.

Why is vuejs/vue a recommended Data Processing Pipelines GitHub Repositories repository?

Renders filtered or sorted data sets using computed properties without modifying the original source.

Why is tensorflow/tensorflow a recommended Data Processing Pipelines GitHub Repositories repository?

Applies optimized routines to perform element-wise operations and shape manipulations on multi-dimensional data structures.

Why is n8n-io/n8n a recommended Data Processing Pipelines GitHub Repositories repository?

Eliminates redundant entries within data streams to maintain unique event records throughout automated sequences.

Why is significant-gravitas/autogpt a recommended Data Processing Pipelines GitHub Repositories repository?

Transforms unstructured keyword objects into structured, typed fields for metric analysis.

Why is avelino/awesome-go a recommended Data Processing Pipelines GitHub Repositories repository?

Streamlines reactive programming and data stream transformations using specialized toolkits.

1.2K repositorios

Awesome GitHub RepositoriesData Processing Pipelines

Systems and workflows for ingesting, transforming, and orchestrating high-throughput data processing tasks.

Explore 1,178 awesome GitHub repositories matching data & databases · Data Processing Pipelines. Refine with filters or upvote what's useful.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

kamranahmedse/developer-roadmap
kamranahmedse/developer-roadmap
357,434Ver en GitHub
Developer Roadmap es una plataforma impulsada por la comunidad que proporciona rutas de aprendizaje estructuradas basadas en grafos para la ingeniería de software. Sirve como un repositorio de conocimiento integral donde los dominios técnicos se organizan en secuencias visuales para guiar la adquisición de habilidades profesionales y el crecimiento profesional. El proyecto se distingue por un ecosistema colaborativo que permite a los usuarios contribuir con roadmaps, curar las mejores prácticas de la industria y mantener perfiles profesionales. Integra marcos de evaluación de diagnóstico para evaluar la competencia técnica, ayudando a los desarrolladores a identificar brechas de conocimiento y prepararse para entrevistas profesionales a través de secuencias de aprendizaje específicas. Más allá de sus capacidades principales de mapeo, la plataforma ofrece ideas de proyectos prácticos y tutoría interactiva para reforzar los conceptos de ingeniería. Proporciona un espacio centralizado para que la comunidad comparta recursos, rastree el desarrollo progresivo de habilidades y navegue por paisajes técnicos complejos.
Provides sequential access to elements within large data collections during processing.
TypeScriptangular-roadmapbackend-roadmapblockchain-roadmap
Ver en GitHub357,434
jwasham/coding-interview-university
jwasham/coding-interview-university
353,639Ver en GitHub
Este proyecto es un roadmap educativo integral diseñado para guiar a los ingenieros de software a través del dominio de los fundamentos de las ciencias de la computación y la preparación para entrevistas técnicas. Proporciona una ruta de aprendizaje estructurada y consciente de las dependencias que organiza conceptos informáticos complejos en un plan de estudios jerárquico, permitiendo a los usuarios construir una base de ingeniería profesional a través del estudio iterativo y la implementación práctica. El plan de estudios se distingue por integrar el conocimiento teórico con el desarrollo profesional, ofreciendo un índice unificado de recursos de referencia cruzada que incluyen libros, artículos académicos y tutoriales en video. Enfatiza la estandarización de la eficiencia algorítmica a través del análisis de complejidad asintótica y proporciona una descomposición de temas granular y modular para facilitar el aprendizaje enfocado e incremental en vastos dominios técnicos. Más allá de los algoritmos y estructuras de datos principales, el repositorio cubre una amplia superficie de capacidades que incluye diseño de arquitectura de sistemas, sistemas distribuidos, seguridad informática y modelado matemático avanzado. También proporciona orientación estratégica para todo el ciclo de vida de contratación, desde la optimización del currículum y la preparación para entrevistas conductuales hasta el crecimiento profesional a largo plazo. Toda la base de conocimientos se mantiene como un repositorio basado en markdown con control de versiones, lo que permite un enfoque colaborativo y agnóstico a la plataforma para la educación técnica.
Reduces data footprint using encoding algorithms to enhance storage efficiency and transmission performance.
algorithmalgorithmscoding-interview
Ver en GitHub353,639
donnemartin/system-design-primer
donnemartin/system-design-primer
353,387Ver en GitHub
Este proyecto es un recurso educativo integral y una guía de estudio centrada en la arquitectura de sistemas distribuidos y el diseño de infraestructura backend. Proporciona un plan de estudios estructurado para dominar los principios de escalabilidad, confiabilidad y rendimiento necesarios para diseñar sistemas de software complejos. El repositorio se distingue por ofrecer un enfoque metódico para la preparación de entrevistas técnicas, incorporando patrones de diseño, compensaciones arquitectónicas y herramientas de repetición espaciada para ayudar a los usuarios a retener conceptos complejos. Enfatiza el análisis basado en restricciones, enseñando a los usuarios cómo evaluar requisitos competitivos como latencia, consistencia y disponibilidad al redactar diseños arquitectónicos. El contenido cubre un amplio espectro de capacidades de diseño de sistemas, incluyendo estrategias para el escalado de bases de datos, gestión de tráfico y optimización de infraestructura. Detalla técnicas para el escalado horizontal, almacenamiento en caché multicapa, comunicación asíncrona y descubrimiento de servicios, al tiempo que proporciona marcos para realizar estimaciones de recursos y planificación de capacidad. La documentación está organizada como una guía de estudio, ofreciendo un camino sistemático a través de los fundamentos de la ingeniería backend y el diseño de sistemas a gran escala.
Provides helper libraries and scripts that assist in the scheduling, monitoring, and management of batch processing jobs.
Pythondesigndesign-patternsdesign-system
Ver en GitHub353,387
vinta/awesome-python
vinta/awesome-python
303,207Ver en GitHub
Este proyecto es un directorio integral curado por la comunidad que organiza un vasto panorama de bibliotecas, frameworks y herramientas de software de Python. Sirve como una base de conocimientos centralizada diseñada para facilitar la navegación del ecosistema y acelerar el descubrimiento de desarrolladores en todo el ciclo de vida del desarrollo de software. El directorio se distingue por proporcionar un índice estructurado de recursos categorizados por dominio técnico, que van desde utilidades de desarrollo fundamentales hasta campos de ingeniería especializados. Cubre capacidades de alto nivel que incluyen inteligencia artificial, ciencia de datos, desarrollo web y gestión de infraestructura, lo que permite a los desarrolladores identificar soluciones verificadas para desafíos técnicos específicos. El proyecto abarca una amplia superficie de capacidades, incluyendo herramientas para la gestión de dependencias, análisis de código estático y pruebas automatizadas. También cataloga recursos para el almacenamiento de datos persistentes, orquestación de infraestructura en la nube y desarrollo de interfaces, proporcionando una referencia unificada para construir y mantener sistemas de software complejos.
Enable fast, relevant query results across datasets through high-performance indexing and full-text search capabilities.
Pythonawesomecollectionspython
Ver en GitHub303,207
thealgorithms/python
TheAlgorithms/Python
221,992Ver en GitHub
Este proyecto es un repositorio completo de implementaciones computacionales verificadas diseñadas para servir como un recurso educativo para la informática y la resolución de problemas algorítmicos. Proporciona una colección estructurada de ejemplos de código que cubren estructuras de datos fundamentales, operaciones matemáticas y conceptos de programación centrales, permitiendo a los usuarios estudiar la lógica y la complejidad detrás de varios métodos computacionales. El repositorio se distingue por un patrón de implementación modular basado en referencias que organiza el código en espacios de nombres lógicos. Este enfoque facilita la ejecución independiente y la claridad educativa, permitiendo a los usuarios explorar la evolución de las estrategias computacionales desde enfoques ingenuos de fuerza bruta hasta soluciones optimizadas de alto rendimiento. Al desacoplar las abstracciones de estructuras de datos de las operaciones algorítmicas, el proyecto asegura que las implementaciones sigan siendo intercambiables y fáciles de analizar. La superficie de capacidades abarca una amplia gama de dominios técnicos, incluyendo aprendizaje automático, criptografía, computación científica y visión por computadora. Incluye implementaciones para modelado predictivo, redes neuronales y análisis estadístico, junto con herramientas para procesamiento de señales digitales, gestión de flujo de red y modelado financiero. La colección también aborda necesidades matemáticas especializadas, como álgebra lineal, cálculos geométricos y manipulación de bits, proporcionando una base amplia para la investigación y aplicaciones de ingeniería.
Shrink digital information streams through encoding techniques to improve storage density and transmission speeds.
Pythonalgorithmalgorithm-competitionsalgorithms-implemented
Ver en GitHub221,992
vuejs/vue
vuejs/vue
209,900Ver en GitHub
Vue es un framework de JavaScript progresivo basado en componentes diseñado para construir interfaces de usuario reactivas y aplicaciones de una sola página. Se centra en un sistema de plantillas declarativo que transforma HTML en funciones de renderizado eficientes, permitiendo a los desarrolladores organizar interfaces complejas en unidades aisladas y reutilizables que se sincronizan automáticamente con el estado de la aplicación. El framework se distingue por un sistema de reactividad de seguimiento de dependencias que monitorea el acceso a los datos durante el renderizado para activar actualizaciones precisas. Proporciona una arquitectura flexible que admite tanto la adopción incremental como una biblioteca ligera como el desarrollo de aplicaciones a gran escala. Los desarrolladores pueden aprovechar un modelo de extensibilidad basado en plugins robusto para inyectar lógica global, mientras que la reconciliación del DOM virtual del framework asegura actualizaciones de interfaz eficientes calculando mutaciones mínimas. Más allá de sus capacidades de renderizado principales, el proyecto incluye un conjunto completo de herramientas para gestionar el estado de la aplicación, enrutamiento basado en URL y renderizado del lado del servidor. Ofrece un amplio soporte para la composición de componentes, distribución de contenido y gestión de animaciones, junto con medidas de seguridad integradas como el escape automático de contenido para prevenir vulnerabilidades comunes. El framework se distribuye con declaraciones de tipo oficiales para admitir el análisis estático y puede instalarse a través de gestores de paquetes estándar o integrarse directamente en entornos de navegador a través de etiquetas de script.
Renders filtered or sorted data sets using computed properties without modifying the original source.
TypeScriptframeworkfrontendjavascript
Ver en GitHub209,900
tensorflow/tensorflow
tensorflow/tensorflow
195,697Ver en GitHub
TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The system provides high-level interfaces for defining neural network architectures, alongside a robust engine for managing multidimensional array structures and tensor mathematics. The framework distinguishes itself through a scalable distributed runtime that orchestrates workloads acr
Applies optimized routines to perform element-wise operations and shape manipulations on multi-dimensional data structures.
C++deep-learningdeep-neural-networksdistributed
Ver en GitHub195,697
n8n-io/n8n
n8n-io/n8n
192,772Ver en GitHub
n8n is a workflow automation platform that combines a visual interface with code-based extensibility to design, orchestrate, and manage automated processes. It provides a comprehensive suite of tools for data transformation, filtering, and storage, allowing users to build complex logic through conditional branching, looping, and sub-workflow execution. The platform supports both pre-built integration nodes and custom code execution in JavaScript or Python, enabling connectivity with a wide range of external services and APIs. The platform includes a suite of generative AI capabilities, such a
Eliminates redundant entries within data streams to maintain unique event records throughout automated sequences.
TypeScriptaiapisautomation
Ver en GitHub192,772
significant-gravitas/autogpt
Significant-Gravitas/AutoGPT
184,973Ver en GitHub
AutoGPT is an orchestration platform designed for building, managing, and deploying autonomous agents. It provides a visual canvas-based environment where users can assemble agents by connecting modular blocks that represent actions, data flows, and conditional logic. The platform supports the entire agent lifecycle, including task scheduling, execution monitoring, and configuration management, while offering a marketplace for discovering and sharing community-built workflows. The project includes a legacy framework for command-line agent execution and an extensible component system for devel
Transforms unstructured keyword objects into structured, typed fields for metric analysis.
Pythonaiartificial-intelligenceautonomous-agents
Ver en GitHub184,973
avelino/awesome-go
avelino/awesome-go
175,576Ver en GitHub
This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently discover resources tailored to specific functional domains. The repository distinguishes itself through a decentralized contribution model, where community-driven updates ensure the index remains current with the rapidly evolving software landscape. Beyond simple resource listing,
Streamlines reactive programming and data stream transformations using specialized toolkits.
Goawesomeawesome-listgo
Ver en GitHub175,576
yt-dlp/yt-dlp
yt-dlp/yt-dlp
170,963Ver en GitHub
This project is a command-line media downloader designed for the systematic retrieval and organization of digital content from diverse online platforms. It functions as an extensible extraction engine that utilizes a declarative format-selection pipeline to automate the identification, merging, and downloading of specific audio and video streams based on user-defined criteria. The system distinguishes itself through a modular architecture that supports custom plugins and site-specific scripts, allowing for the bypass of platform restrictions and the handling of complex authentication challeng
Evaluates stream metadata against defined criteria to transform and restructure raw media into desired file formats.
Pythonclidownloaderpython
Ver en GitHub170,963
huggingface/transformers
huggingface/transformers
161,630Ver en GitHub
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
Structures keyword arguments by modality to ensure type-safe configuration and model-specific overrides during document processing.
Pythonaudiodeep-learningdeepseek
Ver en GitHub161,630
microsoft/markitdown
microsoft/markitdown
154,485Ver en GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
Converts diverse document formats into structured text output by executing programmatic parsing logic to automate complex data extraction workflows.
Pythonautogenautogen-extensionlangchain
Ver en GitHub154,485
langchain-ai/langchain
langchain-ai/langchain
139,458Ver en GitHub
LangChain is an orchestration framework designed for building, managing, and deploying applications powered by large language models. It provides a unified integration layer that normalizes disparate model provider APIs into a consistent set of primitives, enabling developers to build complex, multi-step AI workflows that manage state, memory, and tool execution. The project distinguishes itself through a durable execution runtime that maintains persistent state across long-running processes by checkpointing progress to external storage. It models agent workflows as directed graphs, allowing
Process diverse binary and multimodal data types through unified interfaces designed for complex AI pipelines.
Pythonagentsaiai-agents
Ver en GitHub139,458
mendableai/firecrawl
mendableai/firecrawl
139,399Ver en GitHub
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Transforms unstructured web pages and documents into standardized, machine-readable formats using natural language prompts.
TypeScript
Ver en GitHub139,399
firecrawl/firecrawl
firecrawl/firecrawl
133,479Ver en GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Prepares raw web content for AI by converting it into clean, structured formats like markdown or JSON.
TypeScriptaiai-agentsai-crawler
Ver en GitHub133,479
iptv-org/iptv
iptv-org/iptv
127,909Ver en GitHub
This project is a community-maintained, open-source repository that functions as a centralized directory for streaming metadata. It aggregates publicly available network stream links and organizes them into standardized, machine-readable playlist formats. By acting strictly as a metadata-only index, the platform enables users to access and organize live broadcast content across various third-party media playback applications without hosting or distributing any actual video files. The repository distinguishes itself through a collaborative, crowdsourced workflow where contributors actively mai
Merges distributed community updates into a unified, structured dataset of verified streaming links.
TypeScriptiptvm3uplaylist
Ver en GitHub127,909
d3/d3
d3/d3
113,118Ver en GitHub
D3 is a modular library providing low-level primitives for creating data-driven visualizations. It functions as a flexible framework that allows for direct control over visual presentation by mapping abstract data dimensions to graphical properties, such as position, color, and size, without imposing predefined chart abstractions. The library distinguishes itself by offering specialized tools for complex data representation, including algorithmic layouts for hierarchical structures and geographic projection utilities for mapping spherical coordinates. It also includes a comprehensive suite fo
Comprehensive utilities handle the ordering, searching, summarizing, binning, and grouping of complex data sets.
Shellchartchartsd3
Ver en GitHub113,118
godotengine/godot
godotengine/godot
112,618Ver en GitHub
Godot is a comprehensive, node-based game engine designed for building interactive 2D and 3D applications. It provides an integrated development environment that utilizes a hierarchical scene system to organize objects, propagate spatial transformations, and manage lifecycle events. The engine functions as a cross-platform development suite, allowing developers to author, test, and export software to desktop, mobile, and web environments from a single, unified codebase. The engine distinguishes itself through a modular, component-based architecture that relies on signals-based decoupling for
Implements native data types for vectors, transforms, and arrays to enable high-performance mathematical operations.
C++game-developmentgame-enginegamedev
Ver en GitHub112,618
mzabriskie/axios
mzabriskie/axios
109,096Ver en GitHub
Axios is a promise-based HTTP client used to make asynchronous network requests in both browser and Node.js environments. It functions as a multi-environment network adapter that abstracts the transport layer to ensure consistent behavior across different runtimes. The project distinguishes itself through a request lifecycle management system that allows for the cancellation of active requests, the setting of timeouts, and the monitoring of upload and download transfer progress. It includes a mechanism for intercepting network traffic, enabling the transformation of outgoing requests and inco
Implements automatic serialization of JavaScript objects into JSON, multipart form data, or URL-encoded formats for transmission.
JavaScript
Ver en GitHub109,096

Awesome Data Processing Pipelines GitHub Repositories

kamranahmedse/developer-roadmap

jwasham/coding-interview-university

donnemartin/system-design-primer

vinta/awesome-python

TheAlgorithms/Python

vuejs/vue

tensorflow/tensorflow

n8n-io/n8n

Significant-Gravitas/AutoGPT

avelino/awesome-go

yt-dlp/yt-dlp

huggingface/transformers

microsoft/markitdown

langchain-ai/langchain

mendableai/firecrawl

firecrawl/firecrawl

iptv-org/iptv

d3/d3

godotengine/godot

mzabriskie/axios

Explorar subetiquetas