What are the best Awesome Data Engineering and Infrastructure GitHub Repositories?

Foundational tools for large-scale data collection, ingestion, storage management, and reliability. Explore 1,324 awesome GitHub repositories matching data & databases · Data Engineering and Infrastructure. Refine with filters or upvote what's useful. Top picks: openclaw/openclaw, kamranahmedse/developer-roadmap, donnemartin/system-design-primer, vinta/awesome-python, torvalds/linux, trimstray/the-book-of-secret-knowledge, affaan-m/ecc, significant-gravitas/autogpt, jackfrued/python-100-days,…

Why is openclaw/openclaw a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Exports portable backups of workspace data, authentication credentials, and gateway configurations.

Why is kamranahmedse/developer-roadmap a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Configures expiration policies for cached data to balance performance and data freshness.

Why is donnemartin/system-design-primer a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Details mechanisms for storing frequently accessed data in memory to reduce latency and backend processing requirements.

Why is vinta/awesome-python a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Boost system performance by memoizing frequently accessed data within memory-efficient storage structures.

Why is torvalds/linux a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Manages filesystem operations to provide consistent data access and storage organization across physical media.

Why is trimstray/the-book-of-secret-knowledge a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Navigate and manage file systems through terminal-based interfaces that simplify directory operations.

Why is affaan-m/ecc a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Manages the persistent storage of session summaries and learned skills under configurable root directories.

Why is significant-gravitas/autogpt a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Coordinates the full lifecycle of CSV data imports through dedicated creation, listing, and retrieval methods.

Why is jackfrued/python-100-days a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Understand the fundamentals of web scraping, including ethical considerations and essential toolsets for data extraction.

Why is microsoft/markitdown a recommended Data Engineering and Infrastructure GitHub Repositories repository?

Interprets diverse file formats and generates structured, context-aware Markdown output using advanced language models.

1.3K مستودعات

Awesome GitHub RepositoriesData Engineering and Infrastructure

Foundational tools for large-scale data collection, ingestion, storage management, and reliability.

Explore 1,324 awesome GitHub repositories matching data & databases · Data Engineering and Infrastructure. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

openclaw/openclaw
openclaw/openclaw
380,031عرض على GitHub
Openclaw هي منصة لإدارة بيئات تنفيذ الوكلاء (agents)، توفر البنية التحتية للتحكم في دورات حياة الوكيل، وحالة الجلسة، واستمرارية مساحة العمل. تتميز ببوابة مركزية تتعامل مع حلقات النماذج، واستدعاء الأدوات، وأحداث البث، مع دعم توجيه الوكلاء المتعددين وإدارة الذاكرة المستمرة. تم تصميم النظام لتوحيد توقيعات تنفيذ الأدوات وتوفير واجهة قياسية للتوافق بين الموفرين المختلفين. تتضمن المنصة أدوات مطورين واسعة النطاق، مثل واجهة سطر أوامر لإدارة مساحة العمل، وتسجيل التشخيص، وبنية إضافات (plugin architecture) تسمح بتسجيل أدوات وقدرات مخصصة. تدعم سير العمل الآلي من خلال خطافات (hooks) قائمة على الأحداث، وجدولة المهام، والتكامل مع الخدمات الخارجية. تتم إدارة الأمن من خلال سياسات التنفيذ، وقابلية نقل بيانات الاعتماد، وسير عمل الموافقة على إجراءات الوكيل. يتم دعم النشر من خلال مثبتات البنية التحتية الآلية ومساعدي البوابة المعتمدين على الحاويات، مع أدوات مدمجة للنسخ الاحتياطي وإدارة التكوين. يوفر النظام تنسيقاً مهيكلاً لتنظيم سير العمل متعدد الخطوات ويتضمن أدوات متخصصة لأتمتة المتصفح وتصحيح الكود المهيكل.
Exports portable backups of workspace data, authentication credentials, and gateway configurations.
TypeScriptaiassistantcrustacean
عرض على GitHub380,031
kamranahmedse/developer-roadmap
kamranahmedse/developer-roadmap
357,434عرض على GitHub
Developer Roadmap هي منصة يقودها المجتمع توفر مسارات تعليمية منظمة وقائمة على الرسوم البيانية لهندسة البرمجيات. تعمل كمستودع معرفي شامل حيث يتم تنظيم المجالات التقنية في تسلسلات مرئية لتوجيه اكتساب المهارات المهنية والنمو الوظيفي. يتميز المشروع بنظام بيئي تعاوني يتيح للمستخدمين المساهمة في خرائط الطريق، وتنظيم أفضل ممارسات الصناعة، والحفاظ على الملفات الشخصية المهنية. يدمج أطر تقييم تشخيصية لتقييم الكفاءة التقنية، مما يساعد المطورين على تحديد فجوات المعرفة والتحضير للمقابلات المهنية من خلال تسلسلات تعليمية مستهدفة. إلى جانب قدرات التخطيط الأساسية، توفر المنصة أفكاراً لمشاريع عملية ودروساً تفاعلية لتعزيز المفاهيم الهندسية. وتوفر مساحة مركزية للمجتمع لمشاركة الموارد، وتتبع تطوير المهارات التدريجي، والتنقل في المشاهد التقنية المعقدة.
Configures expiration policies for cached data to balance performance and data freshness.
TypeScriptangular-roadmapbackend-roadmapblockchain-roadmap
عرض على GitHub357,434
donnemartin/system-design-primer
donnemartin/system-design-primer
353,387عرض على GitHub
هذا المشروع عبارة عن مورد تعليمي شامل ودليل دراسي يركز على بنية الأنظمة الموزعة وتصميم البنية التحتية للـ backend. يوفر منهجاً منظماً لإتقان مبادئ القابلية للتوسع، والموثوقية، والأداء المطلوبة لتصميم أنظمة برمجية معقدة. يتميز المستودع بتقديم نهج منهجي للتحضير للمقابلات التقنية، حيث يدمج أنماط التصميم، والمقايضات المعمارية، وأدوات التكرار المتباعد لمساعدة المستخدمين على الاحتفاظ بالمفاهيم المعقدة. ويؤكد على التحليل القائم على القيود، حيث يعلم المستخدمين كيفية تقييم المتطلبات المتنافسة مثل زمن الوصول (latency)، والاتساق، والتوافر عند صياغة التصاميم المعمارية. يغطي المحتوى طيفاً واسعاً من قدرات تصميم النظام، بما في ذلك استراتيجيات توسيع قواعد البيانات، وإدارة حركة المرور، وتحسين البنية التحتية. ويفصل تقنيات التوسع الأفقي، والتخزين المؤقت متعدد الطبقات، والتواصل غير المتزامن، واكتشاف الخدمات، مع توفير أطر عمل لإجراء تقديرات الموارد وتخطيط السعة. يتم تنظيم التوثيق كدليل دراسي، مما يوفر مساراً منهجياً عبر أساسيات هندسة الـ backend وتصميم الأنظمة واسعة النطاق.
Details mechanisms for storing frequently accessed data in memory to reduce latency and backend processing requirements.
Pythondesigndesign-patternsdesign-system
عرض على GitHub353,387
vinta/awesome-python
vinta/awesome-python
303,207عرض على GitHub
هذا المشروع عبارة عن دليل شامل منسق من قبل المجتمع ينظم مشهداً واسعاً من مكتبات وأطر عمل وأدوات برمجيات Python. يعمل كقاعدة معرفية مركزية مصممة لتسهيل التنقل في النظام البيئي وتسريع اكتشاف المطورين عبر دورة حياة تطوير البرمجيات بأكملها. يتميز الدليل بتوفير فهرس منظم للموارد مصنف حسب المجال التقني، بدءاً من أدوات التطوير الأساسية وصولاً إلى المجالات الهندسية المتخصصة. ويغطي قدرات عالية المستوى بما في ذلك الذكاء الاصطناعي، وعلوم البيانات، وتطوير الويب، وإدارة البنية التحتية، مما يسمح للمطورين بتحديد حلول موثوقة لتحديات تقنية محددة. يشمل المشروع نطاقاً واسعاً من القدرات، بما في ذلك أدوات إدارة التبعيات، والتحليل الثابت للكود، والاختبار الآلي. كما يقوم بفهرسة موارد تخزين البيانات المستمرة، وأوركسترا البنية التحتية السحابية، وتطوير الواجهات، مما يوفر مرجعاً موحداً لبناء وصيانة الأنظمة البرمجية المعقدة.
Boost system performance by memoizing frequently accessed data within memory-efficient storage structures.
Pythonawesomecollectionspython
عرض على GitHub303,207
torvalds/linux
torvalds/linux
237,355عرض على GitHub
نواة Linux هي نواة نظام تشغيل متجانسة تدير موارد الأجهزة، والذاكرة، وجدولة العمليات عبر بنيات حوسبة متنوعة. توفر بيئة قياسية متوافقة مع POSIX لتنفيذ التطبيقات مع الحفاظ على إطار عمل تعريف (driver framework) معياري يسمح بالتحميل والإزالة الديناميكية لواجهات الأجهزة. يتميز المشروع بمجموعة أدوات التزامن عالية الأداء، والتي تستخدم بدائيات المزامنة الخالية من القفل وآليات القراءة-النسخ-التحديث لإدارة الوصول إلى البيانات المشتركة في بيئات متعددة النواة. يتضمن مجموعة شاملة لتتبع النواة والأدوات التي تتيح مراقبة غير تدخلية لأحداث النظام، وتنفيذ الوظائف، ومقاييس زمن الوصول. علاوة على ذلك، تفرض النواة ضمانات صارمة لاستقرار الواجهة وتتبع دورة الحياة لضمان التوافق مع الإصدارات السابقة للتطبيقات التابعة. بعيداً عن هويتها الأساسية، يتضمن النظام قدرات واسعة لتجريد الأجهزة، وتنفيذ بروتوكولات الشبكة، وفرض سياسات الأمن. وهو يدعم المتطلبات الهندسية المتخصصة من خلال إدارة حالة الطاقة، وتحسينات الأنظمة المدمجة، وعمليات التمهيد القائمة على البرامج الثابتة. تتميز البنية أيضاً بأطر تشخيصية قوية لتحليل الذاكرة، والتحقق من تنفيذ النظام، والتحقق من صحة نماذج البرمجة المتزامنة. يوفر مستودع المصدر نظام بناء كاملاً لتحويل الكود إلى صور ثنائية قابلة للتنفيذ، بما في ذلك أدوات لاختيار ميزات النواة وتحسين التكوين لتكييف المخرجات مع متطلبات الأجهزة المحددة.
Manages filesystem operations to provide consistent data access and storage organization across physical media.
C
عرض على GitHub237,355
trimstray/the-book-of-secret-knowledge
trimstray/the-book-of-secret-knowledge
228,641عرض على GitHub
يعمل هذا المشروع كمستودع مركزي يقوده المجتمع للمعرفة التقنية والموارد الإدارية. يوفر تصنيفاً هيكلياً يجمع المعلومات المتباينة في إطار عمل قابل للبحث، مما يدعم التعلم المستمر وحل المشكلات السريع لمسؤولي النظام وممارسي الأمن السيبراني. من خلال تعيين الموارد عبر الأمن الهجومي، وإدارة البنية التحتية، وتطوير البرمجيات، فإنه يوفر مساراً موحداً لاكتساب المهارات والمرجع المهني. يتم تعريف المشروع بفلسفة تصميم تعتمد على سطر الأوامر أولاً، مع إعطاء الأولوية للأدوات القائمة على الطرفية والواجهات القابلة للبرمجة لتسهيل إدارة النظام بكفاءة وسير عمل أمني قابل للتكرار. يتميز بنهج مستقل عن المنصة، حيث يحتفظ بالتوثيق والأدلة التشغيلية التي تظل قابلة للتطبيق عبر بيئات Unix المتنوعة والبيئات القائمة على السحابة. يسمح تكامل مجموعة الأدوات المعيارية هذا للمستخدمين بتكوين بيئات مخصصة مصممة لمهام إدارية أو أمنية محددة. يغطي المستودع نطاقاً واسعاً من القدرات، بما في ذلك مجموعات أدوات شاملة لتدقيق النظام، وإدارة الشبكة، وتقوية البنية التحتية. ويوفر مسارات تعليمية منظمة لتطوير مهارات الأمن السيبراني، تتراوح من مختبرات الاختراق الأخلاقي ومعايير اختبار الاختراق إلى تقييم الثغرات وأفضل ممارسات تكوين النظام. تشمل المجموعة أيضاً مجموعة واسعة من أدوات الإنتاجية، وأدوات التشخيص، والمواد التعليمية المصممة لتبسيط الصيانة الروتينية وتعزيز الوضع الأمني العام.
Navigate and manage file systems through terminal-based interfaces that simplify directory operations.
awesomeawesome-listbsd
عرض على GitHub228,641
affaan-m/ecc
affaan-m/ECC
221,981عرض على GitHub
ECC هو إطار عمل لأوركسترا وكلاء LLM ومجموعة أدوات ذكاء اصطناعي عبر المنصات مصممة لتنسيق سير العمل متعدد النماذج. يوفر نظاماً لإدارة أدوار الوكلاء المتخصصة، والمهارات القابلة لإعادة الاستخدام، والتخطيط المهيكل لتنفيذ مهام تطوير البرمجيات المعقدة عبر محررات كود مختلفة مدعومة بالذكاء الاصطناعي. يتميز المشروع كمدير لبروتوكول سياق النموذج (Model Context Protocol)، حيث يوفر طبقة تكوين لدمج الخوادم الخارجية وتدقيق تنفيذ الأدوات. كما ينفذ بيئة أمنية للوكلاء (agentic security sandbox) تقيد الوصول إلى الملفات الحساسة وتفحص تسرب الأسرار لتأمين سير العمل المستقل. يغطي إطار العمل مجالات قدرة واسعة بما في ذلك أتمتة سير عمل البرمجة بالذكاء الاصطناعي مع حواجز حماية التطوير القائم على الاختبار، وتحسين تكلفة النموذج من خلال التوجيه الذكي، وإدارة الذاكرة المعزولة الحالة. كما يتضمن أدوات لفرض معايير البرمجة الخاصة باللغة وإدارة سلوكيات الوكلاء عبر بيئات تطوير متكاملة مختلفة. تتم إدارة النظام من خلال واجهة سطر أوامر تتعامل مع تثبيت الأدوات، وإصلاح التكوين، ونشر إعدادات الأدوات المسبقة.
Manages the persistent storage of session summaries and learned skills under configurable root directories.
JavaScript
عرض على GitHub221,981
significant-gravitas/autogpt
Significant-Gravitas/AutoGPT
184,973عرض على GitHub
AutoGPT is an orchestration platform designed for building, managing, and deploying autonomous agents. It provides a visual canvas-based environment where users can assemble agents by connecting modular blocks that represent actions, data flows, and conditional logic. The platform supports the entire agent lifecycle, including task scheduling, execution monitoring, and configuration management, while offering a marketplace for discovering and sharing community-built workflows. The project includes a legacy framework for command-line agent execution and an extensible component system for devel
Coordinates the full lifecycle of CSV data imports through dedicated creation, listing, and retrieval methods.
Pythonaiartificial-intelligenceautonomous-agents
عرض على GitHub184,973
jackfrued/python-100-days
jackfrued/Python-100-Days
183,425عرض على GitHub
This project is a comprehensive, day-by-day curriculum designed to guide learners through the Python programming language and its professional applications. The content spans from fundamental syntax and object-oriented design to advanced topics including database management, web development, data analysis, and machine learning. The curriculum is structured into distinct modules that cover practical software engineering practices, such as version control, containerization, and system architecture. It also provides resources for technical interview preparation and an analysis of career paths wi
Understand the fundamentals of web scraping, including ethical considerations and essential toolsets for data extraction.
Jupyter Notebook
عرض على GitHub183,425
microsoft/markitdown
microsoft/markitdown
154,485عرض على GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
Interprets diverse file formats and generates structured, context-aware Markdown output using advanced language models.
Pythonautogenautogen-extensionlangchain
عرض على GitHub154,485
langchain-ai/langchain
langchain-ai/langchain
139,458عرض على GitHub
LangChain is an orchestration framework designed for building, managing, and deploying applications powered by large language models. It provides a unified integration layer that normalizes disparate model provider APIs into a consistent set of primitives, enabling developers to build complex, multi-step AI workflows that manage state, memory, and tool execution. The project distinguishes itself through a durable execution runtime that maintains persistent state across long-running processes by checkpointing progress to external storage. It models agent workflows as directed graphs, allowing
Organize directory hierarchies to manage machine-specific state and persistent application data effectively.
Pythonagentsaiai-agents
عرض على GitHub139,458
mendableai/firecrawl
mendableai/firecrawl
139,399عرض على GitHub
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Navigates through entire websites to convert unstructured content into formats optimized for language models.
TypeScript
عرض على GitHub139,399
firecrawl/firecrawl
firecrawl/firecrawl
133,479عرض على GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Transforms unstructured web pages into clean, structured formats specifically optimized for language model ingestion.
TypeScriptaiai-agentsai-crawler
عرض على GitHub133,479
chalarangelo/30-seconds-of-code
Chalarangelo/30-seconds-of-code
128,121عرض على GitHub
30-seconds-of-code is a comprehensive knowledge base and programming snippet library designed to support software engineering education and professional development. It provides a curated collection of reusable code units and technical guides that help developers master core language mechanics, design patterns, and architectural philosophies. The project distinguishes itself by offering a wide-ranging library of algorithmic solutions and web development patterns that are organized into modular, independently testable units. It emphasizes functional programming paradigms and declarative logic,
Provides tools for serializing and persisting data to the local file system.
JavaScriptastroawesome-listcss
عرض على GitHub128,121
excalidraw/excalidraw
excalidraw/excalidraw
125,451عرض على GitHub
This project is a virtual whiteboard component and vector graphics editor designed for creating diagrams with a hand-drawn aesthetic. It provides a canvas-based drawing engine that can be embedded directly into web applications, allowing users to manipulate shapes, upload images, and export visual data into standard formats like PNG, SVG, or JSON. The platform distinguishes itself through a real-time synchronization layer that supports multi-user collaboration across distributed environments. This engine utilizes end-to-end encryption to secure shared sessions and employs a local-first data p
Leverages browser-based storage to maintain application state locally, ensuring data availability and persistence even during offline operation.
TypeScriptcanvascollaborationdiagrams
عرض على GitHub125,451
kubernetes/kubernetes
kubernetes/kubernetes
123,197عرض على GitHub
Kubernetes is a distributed container orchestration platform that automates the deployment, scaling, and management of containerized applications across clusters of computing nodes. It functions as a declarative infrastructure controller, utilizing a control loop architecture that continuously monitors the current system state against user-defined configurations to ensure desired operational outcomes. The system relies on a centralized API-driven interface and a replicated key-value store to maintain a consistent source of truth for all cluster objects. The platform distinguishes itself throu
Maintains a consistent, replicated data store that serves as the reliable source of truth for distributed system states.
Gocncfcontainersgo
عرض على GitHub123,197
comfyanonymous/comfyui
comfyanonymous/ComfyUI
117,322عرض على GitHub
ComfyUI is a modular generative AI workflow orchestrator and node-based GUI for designing and executing complex diffusion model pipelines. It functions as both a visual interface for building generative logic graphs and a programmable backend API that exposes diffusion model operations for external integration. The system distinguishes itself through a graph-based execution model that supports differential workflow execution, re-running only modified nodes to reduce computation. It features dynamic model offloading to manage memory between system RAM and GPU VRAM and utilizes metadata-embedde
Enables saving and loading generation graphs as JSON files or extracting metadata from image and audio files.
Python
عرض على GitHub117,322
papers-we-love/papers-we-love
papers-we-love/papers-we-love
107,093عرض على GitHub
Papers We Love is a community-driven repository and learning network dedicated to the study and discussion of foundational computer science literature. It functions as a centralized educational archive, providing a structured environment where software professionals can engage with academic research to bridge the gap between theoretical concepts and practical application. The project distinguishes itself through a decentralized model of crowdsourced curation, where community members collectively maintain and categorize a vast index of technical resources. Beyond the repository itself, the ini
Parses documentation for external links to facilitate the retrieval of research documents for offline reading.
Shellawesomecomputer-sciencemeetup
عرض على GitHub107,093
immich-app/immich
immich-app/immich
104,236عرض على GitHub
Immich is a self-hosted media management platform designed to provide a centralized, private repository for photos and videos. It functions as a comprehensive system for organizing, backing up, and viewing personal media collections across mobile devices, web browsers, and external storage locations. By maintaining full control over data ownership and storage infrastructure, the platform ensures that users retain sovereignty over their digital assets. The system distinguishes itself through a distributed architecture that coordinates background media synchronization, real-time filesystem moni
Manages automated scheduling, retention policies, and manual triggers to protect essential system metadata and database snapshots.
TypeScriptbackup-toolfluttergoogle-photos
عرض على GitHub104,236
pytorch/pytorch
pytorch/pytorch
100,814عرض على GitHub
PyTorch is a machine learning framework centered on a GPU-ready tensor library that supports multi-dimensional array operations across both CPU and accelerator hardware. It provides a foundational infrastructure for mathematical computation and dynamic neural network construction, utilizing a tape-based automatic differentiation system that allows for flexible, non-static graph execution. The framework is designed for deep integration with Python, enabling natural usage alongside standard scientific computing ecosystems. It distinguishes itself through a comprehensive distributed training sui
Persists tensors and complex data structures to disk through native loading and saving mechanisms.
Pythonautograddeep-learninggpu
عرض على GitHub100,814

Awesome Data Engineering and Infrastructure GitHub Repositories

openclaw/openclaw

kamranahmedse/developer-roadmap

donnemartin/system-design-primer

vinta/awesome-python

torvalds/linux

trimstray/the-book-of-secret-knowledge

affaan-m/ECC

Significant-Gravitas/AutoGPT

jackfrued/Python-100-Days

microsoft/markitdown

langchain-ai/langchain

mendableai/firecrawl

firecrawl/firecrawl

Chalarangelo/30-seconds-of-code

excalidraw/excalidraw

kubernetes/kubernetes

comfyanonymous/ComfyUI

papers-we-love/papers-we-love

immich-app/immich

pytorch/pytorch

استكشف الوسوم الفرعية