15 repos

Awesome GitHub RepositoriesData Engineering

Infrastructure and frameworks used to build, manage, and scale complex systems for processing and analyzing large datasets.

Explore 15 awesome GitHub repositories matching data & databases · Data Engineering. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

sindresorhus/awesome
sindresorhus/awesome
438,690GitHubView on GitHub
This project is a community-curated knowledge base that organizes vast technical ecosystems into a hierarchical, human-readable directory. It serves as a comprehensive index of libraries, frameworks, and methodologies, designed to facilitate discovery and professional development across the entire spectrum of software
awesomeawesome-listlists
vinta/awesome-python
vinta/awesome-python
283,687GitHubView on GitHub
This project is a comprehensive, community-curated directory that organizes a vast landscape of Python software libraries, frameworks, and tools. It serves as a centralized knowledge base designed to facilitate ecosystem navigation and accelerate developer discovery across the entire software development lifecycle. Th
Pythonawesomecollectionspython
tensorflow/tensorflow
tensorflow/tensorflow
193,864GitHubView on GitHub
TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The syst
C++deep-learningdeep-neural-networksdistributed
d3/d3
d3/d3
112,379GitHubView on GitHub
D3 is a modular library providing low-level primitives for creating data-driven visualizations. It functions as a flexible framework that allows for direct control over visual presentation by mapping abstract data dimensions to graphical properties, such as position, color, and size, without imposing predefined chart a
Shellchartchartsd3
pytorch/pytorch
pytorch/pytorch
97,601GitHubView on GitHub
PyTorch is a machine learning framework centered on a GPU-ready tensor library that supports multi-dimensional array operations across both CPU and accelerator hardware. It provides a foundational infrastructure for mathematical computation and dynamic neural network construction, utilizing a tape-based automatic diffe
Pythonautograddeep-learninggpu
microsoft/markitdown
microsoft/markitdown
87,305GitHubView on GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine
Pythonautogenautogen-extensionlangchain
macrozheng/mall
macrozheng/mall
82,926GitHubView on GitHub
This project is an enterprise-grade Java framework designed for building scalable, full-stack e-commerce applications. It provides a comprehensive foundation for microservice-based distributed architectures, enabling the development of complex retail platforms that include product management, order processing, and secu
Javadockerelasticsearchelk
elastic/elasticsearch
elastic/elasticsearch
76,163GitHubView on GitHub
Elasticsearch is a distributed search engine and document store designed for the high-performance indexing and retrieval of massive volumes of unstructured data. It functions as a centralized analytics platform, providing a schema-flexible architecture that organizes information into searchable indices while maintainin
Javaelasticsearchjavasearch-engine
awesomedata/awesome-public-datasets
awesomedata/awesome-public-datasets
72,846GitHubView on GitHub
This project is a community-maintained, open-access directory of high-quality public datasets. It serves as a centralized reference point for researchers, developers, and data scientists to locate reliable information sources across a wide spectrum of industries and scientific fields. By providing a structured index, t
aaron-swartzawesome-public-datasetsdatasets
dair-ai/Prompt-Engineering-Guide
dair-ai/Prompt-Engineering-Guide
70,526GitHubView on GitHub
This project is a comprehensive educational resource and knowledge base dedicated to the development and application of large language models and autonomous agentic systems. It provides a structured framework for understanding prompt engineering, context management, and the architectural patterns required to build task
MDXagentagentsai-agents
minio/minio
minio/minio
60,346GitHubView on GitHub
MinIO is a software-defined, cloud-native object storage server designed to manage large volumes of unstructured data. It functions as a distributed storage cluster that aggregates multiple independent nodes into a unified, scalable pool, providing a high-performance infrastructure compatible with standard cloud storag
Goamazon-s3cloudcloudnative
pathwaycom/pathway
pathwaycom/pathway
59,684GitHubView on GitHub
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with
Pythonbatch-processingdata-analyticsdata-pipelines
pathwaycom/llm-app
pathwaycom/llm-app
56,311GitHubView on GitHub
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transfo
Jupyter Notebookchatbothugging-facellm
ultralytics/ultralytics
ultralytics/ultralytics
53,426GitHubView on GitHub
Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification
Pythonclicomputer-visiondeep-learning
unslothai/unsloth
unslothai/unsloth
52,461GitHubView on GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade
Pythonagentdeepseekdeepseek-r1

Explore sub-tags

15 repos

Awesome GitHub RepositoriesData Engineering

Infrastructure and frameworks used to build, manage, and scale complex systems for processing and analyzing large datasets.

Explore 15 awesome GitHub repositories matching data & databases · Data Engineering. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

sindresorhus/awesome
sindresorhus/awesome
438,690GitHubView on GitHub
This project is a community-curated knowledge base that organizes vast technical ecosystems into a hierarchical, human-readable directory. It serves as a comprehensive index of libraries, frameworks, and methodologies, designed to facilitate discovery and professional development across the entire spectrum of software
awesomeawesome-listlists
vinta/awesome-python
vinta/awesome-python
283,687GitHubView on GitHub
This project is a comprehensive, community-curated directory that organizes a vast landscape of Python software libraries, frameworks, and tools. It serves as a centralized knowledge base designed to facilitate ecosystem navigation and accelerate developer discovery across the entire software development lifecycle. Th
Pythonawesomecollectionspython
tensorflow/tensorflow
tensorflow/tensorflow
193,864GitHubView on GitHub
TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The syst
C++deep-learningdeep-neural-networksdistributed
d3/d3
d3/d3
112,379GitHubView on GitHub
D3 is a modular library providing low-level primitives for creating data-driven visualizations. It functions as a flexible framework that allows for direct control over visual presentation by mapping abstract data dimensions to graphical properties, such as position, color, and size, without imposing predefined chart a
Shellchartchartsd3
pytorch/pytorch
pytorch/pytorch
97,601GitHubView on GitHub
PyTorch is a machine learning framework centered on a GPU-ready tensor library that supports multi-dimensional array operations across both CPU and accelerator hardware. It provides a foundational infrastructure for mathematical computation and dynamic neural network construction, utilizing a tape-based automatic diffe
Pythonautograddeep-learninggpu
microsoft/markitdown
microsoft/markitdown
87,305GitHubView on GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine
Pythonautogenautogen-extensionlangchain
macrozheng/mall
macrozheng/mall
82,926GitHubView on GitHub
This project is an enterprise-grade Java framework designed for building scalable, full-stack e-commerce applications. It provides a comprehensive foundation for microservice-based distributed architectures, enabling the development of complex retail platforms that include product management, order processing, and secu
Javadockerelasticsearchelk
elastic/elasticsearch
elastic/elasticsearch
76,163GitHubView on GitHub
Elasticsearch is a distributed search engine and document store designed for the high-performance indexing and retrieval of massive volumes of unstructured data. It functions as a centralized analytics platform, providing a schema-flexible architecture that organizes information into searchable indices while maintainin
Javaelasticsearchjavasearch-engine
awesomedata/awesome-public-datasets
awesomedata/awesome-public-datasets
72,846GitHubView on GitHub
This project is a community-maintained, open-access directory of high-quality public datasets. It serves as a centralized reference point for researchers, developers, and data scientists to locate reliable information sources across a wide spectrum of industries and scientific fields. By providing a structured index, t
aaron-swartzawesome-public-datasetsdatasets
dair-ai/Prompt-Engineering-Guide
dair-ai/Prompt-Engineering-Guide
70,526GitHubView on GitHub
This project is a comprehensive educational resource and knowledge base dedicated to the development and application of large language models and autonomous agentic systems. It provides a structured framework for understanding prompt engineering, context management, and the architectural patterns required to build task
MDXagentagentsai-agents
minio/minio
minio/minio
60,346GitHubView on GitHub
MinIO is a software-defined, cloud-native object storage server designed to manage large volumes of unstructured data. It functions as a distributed storage cluster that aggregates multiple independent nodes into a unified, scalable pool, providing a high-performance infrastructure compatible with standard cloud storag
Goamazon-s3cloudcloudnative
pathwaycom/pathway
pathwaycom/pathway
59,684GitHubView on GitHub
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with
Pythonbatch-processingdata-analyticsdata-pipelines
pathwaycom/llm-app
pathwaycom/llm-app
56,311GitHubView on GitHub
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transfo
Jupyter Notebookchatbothugging-facellm
ultralytics/ultralytics
ultralytics/ultralytics
53,426GitHubView on GitHub
Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification
Pythonclicomputer-visiondeep-learning
unslothai/unsloth
unslothai/unsloth
52,461GitHubView on GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade
Pythonagentdeepseekdeepseek-r1

Awesome Data Engineering GitHub Repositories

sindresorhus/awesome

vinta/awesome-python

tensorflow/tensorflow

d3/d3

pytorch/pytorch

microsoft/markitdown

macrozheng/mall

elastic/elasticsearch

awesomedata/awesome-public-datasets

dair-ai/Prompt-Engineering-Guide

minio/minio

pathwaycom/pathway

pathwaycom/llm-app

ultralytics/ultralytics

unslothai/unsloth

Explore sub-tags

Awesome Data Engineering GitHub Repositories

sindresorhus/awesome

vinta/awesome-python

tensorflow/tensorflow

d3/d3

pytorch/pytorch

microsoft/markitdown

macrozheng/mall

elastic/elasticsearch

awesomedata/awesome-public-datasets

dair-ai/Prompt-Engineering-Guide

minio/minio

pathwaycom/pathway

pathwaycom/llm-app

ultralytics/ultralytics

unslothai/unsloth

Explore sub-tags