What are the best Awesome Data Processing Pipelines GitHub Repositories?

Systems and workflows for ingesting, transforming, and orchestrating high-throughput data processing tasks. Explore 1,176 awesome GitHub repositories matching data & databases · Data Processing Pipelines. Refine with filters or upvote what's useful. Top picks: kamranahmedse/developer-roadmap, jwasham/coding-interview-university, donnemartin/system-design-primer, vinta/awesome-python, thealgorithms/python, vuejs/vue, tensorflow/tensorflow, n8n-io/n8n, significant-gravitas/autogpt, avelino/aweso…

Why is kamranahmedse/developer-roadmap a recommended Data Processing Pipelines GitHub Repositories repository?

Provides sequential access to elements within large data collections during processing.

Why is jwasham/coding-interview-university a recommended Data Processing Pipelines GitHub Repositories repository?

Reduces data footprint using encoding algorithms to enhance storage efficiency and transmission performance.

Why is donnemartin/system-design-primer a recommended Data Processing Pipelines GitHub Repositories repository?

Provides helper libraries and scripts that assist in the scheduling, monitoring, and management of batch processing jobs.

Why is vinta/awesome-python a recommended Data Processing Pipelines GitHub Repositories repository?

Enable fast, relevant query results across datasets through high-performance indexing and full-text search capabilities.

Why is thealgorithms/python a recommended Data Processing Pipelines GitHub Repositories repository?

Shrink digital information streams through encoding techniques to improve storage density and transmission speeds.

Why is vuejs/vue a recommended Data Processing Pipelines GitHub Repositories repository?

Renders filtered or sorted data sets using computed properties without modifying the original source.

Why is tensorflow/tensorflow a recommended Data Processing Pipelines GitHub Repositories repository?

Applies optimized routines to perform element-wise operations and shape manipulations on multi-dimensional data structures.

Why is n8n-io/n8n a recommended Data Processing Pipelines GitHub Repositories repository?

Eliminates redundant entries within data streams to maintain unique event records throughout automated sequences.

Why is significant-gravitas/autogpt a recommended Data Processing Pipelines GitHub Repositories repository?

Transforms unstructured keyword objects into structured, typed fields for metric analysis.

Why is avelino/awesome-go a recommended Data Processing Pipelines GitHub Repositories repository?

Streamlines reactive programming and data stream transformations using specialized toolkits.

1.2K repository-uri

Awesome GitHub RepositoriesData Processing Pipelines

Systems and workflows for ingesting, transforming, and orchestrating high-throughput data processing tasks.

Explore 1,176 awesome GitHub repositories matching data & databases · Data Processing Pipelines. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

kamranahmedse/developer-roadmap
kamranahmedse/developer-roadmap
357,434Vezi pe GitHub
Developer Roadmap este o platformă condusă de comunitate care oferă căi de învățare structurate, bazate pe grafuri, pentru ingineria software. Servește drept repository cuprinzător de cunoștințe unde domeniile tehnice sunt organizate în secvențe vizuale pentru a ghida dobândirea competențelor profesionale și creșterea în carieră. Proiectul se distinge printr-un ecosistem colaborativ care permite utilizatorilor să contribuie cu roadmap-uri, să cureție cele mai bune practici din industrie și să mențină profiluri profesionale. Acesta integrează framework-uri de evaluare diagnostică pentru a evalua competența tehnică, ajutând dezvoltatorii să identifice lacunele de cunoștințe și să se pregătească pentru interviurile profesionale prin secvențe de învățare țintite. Dincolo de capabilitățile sale de bază de mapare, platforma oferă idei practice de proiecte și tutorat interactiv pentru a consolida conceptele de inginerie. Oferă un spațiu centralizat pentru ca comunitatea să partajeze resurse, să urmărească dezvoltarea progresivă a competențelor și să navigheze prin peisaje tehnice complexe.
Provides sequential access to elements within large data collections during processing.
TypeScriptangular-roadmapbackend-roadmapblockchain-roadmap
Vezi pe GitHub357,434
jwasham/coding-interview-university
jwasham/coding-interview-university
353,639Vezi pe GitHub
Acest proiect este un roadmap educațional cuprinzător conceput pentru a ghida inginerii software prin stăpânirea fundamentelor informaticii și pregătirea pentru interviurile tehnice. Oferă o cale de învățare structurată, conștientă de dependențe, care organizează concepte complexe de calcul într-un curriculum ierarhic, permițând utilizatorilor să construiască o fundație profesională de inginerie prin studiu iterativ și implementare practică. Curriculumul se distinge prin integrarea cunoștințelor teoretice cu dezvoltarea profesională, oferind un index unificat de resurse referențiate încrucișat, inclusiv cărți, lucrări academice și tutoriale video. Acesta pune accent pe standardizarea eficienței algoritmice prin analiza complexității asimptotice și oferă o descompunere granulară și modulară a subiectelor pentru a facilita învățarea concentrată și incrementală în domenii tehnice vaste. Dincolo de algoritmii de bază și structurile de date, repository-ul acoperă o suprafață largă de capabilități, inclusiv designul arhitecturii de sistem, sisteme distribuite, securitatea informatică și modelarea matematică avansată. De asemenea, oferă îndrumări strategice pentru întregul ciclu de angajare, de la optimizarea CV-ului și pregătirea interviului comportamental până la creșterea carierei pe termen lung. Întreaga bază de cunoștințe este menținută ca un repository controlat prin versiuni, bazat pe markdown, permițând o abordare agnostică față de platformă și colaborativă pentru educația tehnică.
Reduces data footprint using encoding algorithms to enhance storage efficiency and transmission performance.
algorithmalgorithmscoding-interview
Vezi pe GitHub353,639
donnemartin/system-design-primer
donnemartin/system-design-primer
353,387Vezi pe GitHub
Acest proiect este o resursă educațională cuprinzătoare și un ghid de studiu axat pe arhitectura sistemelor distribuite și designul infrastructurii backend. Oferă un curriculum structurat pentru stăpânirea principiilor de scalabilitate, fiabilitate și performanță necesare pentru a proiecta sisteme software complexe. Repository-ul se distinge prin oferirea unei abordări metodice pentru pregătirea interviurilor tehnice, încorporând tipare de design, compromisuri arhitecturale și instrumente de repetiție spațiată pentru a ajuta utilizatorii să rețină concepte complexe. Pune accent pe analiza bazată pe constrângeri, învățând utilizatorii cum să evalueze cerințele concurente precum latența, consistența și disponibilitatea atunci când schițează design-uri arhitecturale. Conținutul acoperă un spectru larg de capabilități de design de sistem, inclusiv strategii pentru scalarea bazelor de date, gestionarea traficului și optimizarea infrastructurii. Detaliază tehnici pentru scalarea orizontală, caching-ul pe mai multe niveluri, comunicarea asincronă și descoperirea serviciilor, oferind în același timp framework-uri pentru efectuarea estimărilor de resurse și planificarea capacității. Documentația este organizată ca un ghid de studiu, oferind o cale sistematică prin fundamentele ingineriei backend și designul sistemelor la scară largă.
Provides helper libraries and scripts that assist in the scheduling, monitoring, and management of batch processing jobs.
Pythondesigndesign-patternsdesign-system
Vezi pe GitHub353,387
vinta/awesome-python
vinta/awesome-python
303,207Vezi pe GitHub
Acest proiect este un director cuprinzător, curatoriat de comunitate, care organizează un peisaj vast de biblioteci, framework-uri și instrumente software Python. Servește drept bază de cunoștințe centralizată concepută pentru a facilita navigarea în ecosistem și a accelera descoperirea de către dezvoltatori pe parcursul întregului ciclu de viață al dezvoltării software. Directorul se distinge prin furnizarea unui index structurat de resurse categorisite pe domeniu tehnic, variind de la utilitare fundamentale de dezvoltare la domenii de inginerie specializate. Acoperă capabilități de nivel înalt, inclusiv inteligență artificială, știința datelor, dezvoltare web și gestionarea infrastructurii, permițând dezvoltatorilor să identifice soluții verificate pentru provocări tehnice specifice. Proiectul cuprinde o suprafață largă de capabilități, inclusiv instrumente pentru gestionarea dependențelor, analiza statică a codului și testarea automatizată. De asemenea, cataloghează resurse pentru stocarea persistentă a datelor, orchestrarea infrastructurii cloud și dezvoltarea interfețelor, oferind o referință unificată pentru construirea și menținerea sistemelor software complexe.
Enable fast, relevant query results across datasets through high-performance indexing and full-text search capabilities.
Pythonawesomecollectionspython
Vezi pe GitHub303,207
thealgorithms/python
TheAlgorithms/Python
221,992Vezi pe GitHub
Acest proiect este un repository cuprinzător de implementări computaționale verificate, conceput pentru a servi drept resursă educațională pentru informatică și rezolvarea problemelor algoritmice. Oferă o colecție structurată de exemple de cod care acoperă structuri de date fundamentale, operațiuni matematice și concepte de bază de programare, permițând utilizatorilor să studieze logica și complexitatea din spatele diferitelor metode computaționale. Repository-ul se distinge printr-un tipar de implementare modular, bazat pe referințe, care organizează codul în spații de nume logice. Această abordare facilitează execuția independentă și claritatea educațională, permițând utilizatorilor să exploreze evoluția strategiilor computaționale de la abordări naive de tip brute-force la soluții optimizate, de înaltă performanță. Prin decuplarea abstracțiilor structurilor de date de operațiunile algoritmice, proiectul asigură că implementările rămân interschimbabile și ușor de analizat. Suprafața de capabilități acoperă o gamă largă de domenii tehnice, inclusiv învățarea automată, criptografia, calculul științific și viziunea computerizată. Include implementări pentru modelare predictivă, rețele neuronale și analiză statistică, alături de instrumente pentru procesarea semnalelor digitale, gestionarea fluxului de rețea și modelarea financiară. Colecția abordează, de asemenea, nevoi matematice specializate, cum ar fi algebra liniară, calculele geometrice și manipularea biților, oferind o fundație largă pentru cercetare și aplicații de inginerie.
Shrink digital information streams through encoding techniques to improve storage density and transmission speeds.
Pythonalgorithmalgorithm-competitionsalgorithms-implemented
Vezi pe GitHub221,992
vuejs/vue
vuejs/vue
209,900Vezi pe GitHub
Vue este un framework JavaScript progresiv, bazat pe componente, conceput pentru construirea de interfețe utilizator reactive și aplicații single-page. Se concentrează pe un sistem de template-uri declarativ care transformă HTML-ul în funcții de randare eficiente, permițând dezvoltatorilor să organizeze interfețe complexe în unități izolate, reutilizabile, care se sincronizează automat cu starea aplicației. Framework-ul se distinge printr-un sistem de reactivitate bazat pe urmărirea dependențelor care monitorizează accesul la date în timpul randării pentru a declanșa actualizări precise. Oferă o arhitectură flexibilă care suportă atât adoptarea incrementală ca bibliotecă ușoară, cât și dezvoltarea de aplicații la scară largă. Dezvoltatorii pot utiliza un model de extensibilitate robust, bazat pe plugin-uri, pentru a injecta logică globală, în timp ce reconcilierea virtuală a DOM-ului framework-ului asigură actualizări eficiente ale interfeței prin calcularea mutațiilor minime. Dincolo de capabilitățile sale de randare de bază, proiectul include o suită cuprinzătoare de instrumente pentru gestionarea stării aplicației, rutarea bazată pe URL și randarea pe partea de server. Oferă suport extins pentru compunerea componentelor, distribuția conținutului și gestionarea animațiilor, alături de măsuri de securitate încorporate, cum ar fi escaparea automată a conținutului pentru a preveni vulnerabilitățile comune. Framework-ul este distribuit cu declarații oficiale de tip pentru a susține analiza statică și poate fi instalat prin manageri de pachete standard sau integrat direct în mediile de browser prin tag-uri script.
Renders filtered or sorted data sets using computed properties without modifying the original source.
TypeScriptframeworkfrontendjavascript
Vezi pe GitHub209,900
tensorflow/tensorflow
tensorflow/tensorflow
195,697Vezi pe GitHub
TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The system provides high-level interfaces for defining neural network architectures, alongside a robust engine for managing multidimensional array structures and tensor mathematics. The framework distinguishes itself through a scalable distributed runtime that orchestrates workloads acr
Applies optimized routines to perform element-wise operations and shape manipulations on multi-dimensional data structures.
C++deep-learningdeep-neural-networksdistributed
Vezi pe GitHub195,697
n8n-io/n8n
n8n-io/n8n
192,772Vezi pe GitHub
n8n is a workflow automation platform that combines a visual interface with code-based extensibility to design, orchestrate, and manage automated processes. It provides a comprehensive suite of tools for data transformation, filtering, and storage, allowing users to build complex logic through conditional branching, looping, and sub-workflow execution. The platform supports both pre-built integration nodes and custom code execution in JavaScript or Python, enabling connectivity with a wide range of external services and APIs. The platform includes a suite of generative AI capabilities, such a
Eliminates redundant entries within data streams to maintain unique event records throughout automated sequences.
TypeScriptaiapisautomation
Vezi pe GitHub192,772
significant-gravitas/autogpt
Significant-Gravitas/AutoGPT
184,973Vezi pe GitHub
AutoGPT is an orchestration platform designed for building, managing, and deploying autonomous agents. It provides a visual canvas-based environment where users can assemble agents by connecting modular blocks that represent actions, data flows, and conditional logic. The platform supports the entire agent lifecycle, including task scheduling, execution monitoring, and configuration management, while offering a marketplace for discovering and sharing community-built workflows. The project includes a legacy framework for command-line agent execution and an extensible component system for devel
Transforms unstructured keyword objects into structured, typed fields for metric analysis.
Pythonaiartificial-intelligenceautonomous-agents
Vezi pe GitHub184,973
avelino/awesome-go
avelino/awesome-go
175,576Vezi pe GitHub
This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently discover resources tailored to specific functional domains. The repository distinguishes itself through a decentralized contribution model, where community-driven updates ensure the index remains current with the rapidly evolving software landscape. Beyond simple resource listing,
Streamlines reactive programming and data stream transformations using specialized toolkits.
Goawesomeawesome-listgo
Vezi pe GitHub175,576
yt-dlp/yt-dlp
yt-dlp/yt-dlp
170,963Vezi pe GitHub
This project is a command-line media downloader designed for the systematic retrieval and organization of digital content from diverse online platforms. It functions as an extensible extraction engine that utilizes a declarative format-selection pipeline to automate the identification, merging, and downloading of specific audio and video streams based on user-defined criteria. The system distinguishes itself through a modular architecture that supports custom plugins and site-specific scripts, allowing for the bypass of platform restrictions and the handling of complex authentication challeng
Evaluates stream metadata against defined criteria to transform and restructure raw media into desired file formats.
Pythonclidownloaderpython
Vezi pe GitHub170,963
huggingface/transformers
huggingface/transformers
161,630Vezi pe GitHub
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
Structures keyword arguments by modality to ensure type-safe configuration and model-specific overrides during document processing.
Pythonaudiodeep-learningdeepseek
Vezi pe GitHub161,630
microsoft/markitdown
microsoft/markitdown
154,485Vezi pe GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
Converts diverse document formats into structured text output by executing programmatic parsing logic to automate complex data extraction workflows.
Pythonautogenautogen-extensionlangchain
Vezi pe GitHub154,485
langchain-ai/langchain
langchain-ai/langchain
139,458Vezi pe GitHub
LangChain is an orchestration framework designed for building, managing, and deploying applications powered by large language models. It provides a unified integration layer that normalizes disparate model provider APIs into a consistent set of primitives, enabling developers to build complex, multi-step AI workflows that manage state, memory, and tool execution. The project distinguishes itself through a durable execution runtime that maintains persistent state across long-running processes by checkpointing progress to external storage. It models agent workflows as directed graphs, allowing
Process diverse binary and multimodal data types through unified interfaces designed for complex AI pipelines.
Pythonagentsaiai-agents
Vezi pe GitHub139,458
mendableai/firecrawl
mendableai/firecrawl
139,399Vezi pe GitHub
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Transforms unstructured web pages and documents into standardized, machine-readable formats using natural language prompts.
TypeScript
Vezi pe GitHub139,399
firecrawl/firecrawl
firecrawl/firecrawl
133,479Vezi pe GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Prepares raw web content for AI by converting it into clean, structured formats like markdown or JSON.
TypeScriptaiai-agentsai-crawler
Vezi pe GitHub133,479
iptv-org/iptv
iptv-org/iptv
127,909Vezi pe GitHub
This project is a community-maintained, open-source repository that functions as a centralized directory for streaming metadata. It aggregates publicly available network stream links and organizes them into standardized, machine-readable playlist formats. By acting strictly as a metadata-only index, the platform enables users to access and organize live broadcast content across various third-party media playback applications without hosting or distributing any actual video files. The repository distinguishes itself through a collaborative, crowdsourced workflow where contributors actively mai
Merges distributed community updates into a unified, structured dataset of verified streaming links.
TypeScriptiptvm3uplaylist
Vezi pe GitHub127,909
d3/d3
d3/d3
113,118Vezi pe GitHub
D3 is a modular library providing low-level primitives for creating data-driven visualizations. It functions as a flexible framework that allows for direct control over visual presentation by mapping abstract data dimensions to graphical properties, such as position, color, and size, without imposing predefined chart abstractions. The library distinguishes itself by offering specialized tools for complex data representation, including algorithmic layouts for hierarchical structures and geographic projection utilities for mapping spherical coordinates. It also includes a comprehensive suite fo
Comprehensive utilities handle the ordering, searching, summarizing, binning, and grouping of complex data sets.
Shellchartchartsd3
Vezi pe GitHub113,118
godotengine/godot
godotengine/godot
112,618Vezi pe GitHub
Godot is a comprehensive, node-based game engine designed for building interactive 2D and 3D applications. It provides an integrated development environment that utilizes a hierarchical scene system to organize objects, propagate spatial transformations, and manage lifecycle events. The engine functions as a cross-platform development suite, allowing developers to author, test, and export software to desktop, mobile, and web environments from a single, unified codebase. The engine distinguishes itself through a modular, component-based architecture that relies on signals-based decoupling for
Implements native data types for vectors, transforms, and arrays to enable high-performance mathematical operations.
C++game-developmentgame-enginegamedev
Vezi pe GitHub112,618
mzabriskie/axios
mzabriskie/axios
109,096Vezi pe GitHub
Axios is a promise-based HTTP client used to make asynchronous network requests in both browser and Node.js environments. It functions as a multi-environment network adapter that abstracts the transport layer to ensure consistent behavior across different runtimes. The project distinguishes itself through a request lifecycle management system that allows for the cancellation of active requests, the setting of timeouts, and the monitoring of upload and download transfer progress. It includes a mechanism for intercepting network traffic, enabling the transformation of outgoing requests and inco
Implements automatic serialization of JavaScript objects into JSON, multipart form data, or URL-encoded formats for transmission.
JavaScript
Vezi pe GitHub109,096