# ELT Data Pipeline Tools

> Search results for `ELT tool that loads raw data then transforms it in the warehouse` on awesome-repositories.com. 114 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/elt-tool-that-loads-raw-data-then-transforms-it-in-the-warehouse

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/elt-tool-that-loads-raw-data-then-transforms-it-in-the-warehouse).**

## Results

- [corentinth/it-tools](https://awesome-repositories.com/repository/corentinth-it-tools.md) (39,343 ⭐) — This project is a browser-based developer toolkit that provides a collection of offline-first utilities for common data transformations and encoding tasks. It functions as a static web application, bundling multiple independent productivity tools into a single-page interface designed for rapid technical task execution.

The suite operates entirely on the client side, ensuring that all data processing occurs locally within the user browser without requiring a backend server or external service dependencies. This architecture prioritizes privacy and security by keeping sensitive information off
- [dbt-labs/dbt-core](https://awesome-repositories.com/repository/dbt-labs-dbt-core.md) (13,051 ⭐) — dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history.

The project distinguishes itself through an adapter-based d
- [aws/aws-cdk](https://awesome-repositories.com/repository/aws-aws-cdk.md) (12,817 ⭐) — The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane.

The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
- [airbytehq/airbyte](https://awesome-repositories.com/repository/airbytehq-airbyte.md) (21,472 ⭐) — Airbyte is a data integration platform designed to synchronize information between diverse applications, databases, and data warehouses. It functions as an extract, transform, and load orchestrator that manages automated data movement workflows across cloud, on-premise, and hybrid environments. The platform provides a standardized interface for connectors, enabling the movement of structured and unstructured data while maintaining stateful checkpoints for reliable incremental syncing.

The platform distinguishes itself through a containerized architecture that isolates connectors to prevent de
- [huggingface/transformers](https://awesome-repositories.com/repository/huggingface-transformers.md) (161,630 ⭐) — Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and
- [huggingface/transformers.js](https://awesome-repositories.com/repository/huggingface-transformers-js.md) (15,420 ⭐) — This library is a web-native engine designed to execute pretrained machine learning models directly within the browser. It functions as a client-side inference framework, enabling developers to run complex neural networks for natural language processing, computer vision, and audio tasks without requiring a backend server or external API calls.

The framework distinguishes itself by providing a unified pipeline-based abstraction that handles the entire lifecycle of model execution. It manages the dynamic retrieval of model weights and configurations from remote registries, while simultaneously
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through ad
- [alibaba/otter](https://awesome-repositories.com/repository/alibaba-otter.md) (8,127 ⭐) — Otter is a distributed database synchronization system and change data capture tool designed to replicate data between databases across multiple geographic regions. It functions as a synchronization orchestrator and ETL data pipeline that mirrors records and associated files in real time.

The system employs incremental log parsing to capture database changes and utilizes a consistency-based convergence algorithm and loop-avoidance logic to manage bi-directional replication. It processes data through a pipeline of selection, extraction, transformation, and loading to handle joins and format co
- [dagster-io/dagster](https://awesome-repositories.com/repository/dagster-io-dagster.md) (14,974 ⭐) — Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality.

The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
- [windofshadow/that](https://awesome-repositories.com/repository/windofshadow-that.md) (121 ⭐) — This repository contains the Pytorch implementation of the THAT methods in the following paper:
- [ucbepic/docetl](https://awesome-repositories.com/repository/ucbepic-docetl.md) (3,597 ⭐) — docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas.

The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
- [pathwaycom/llm-app](https://awesome-repositories.com/repository/pathwaycom-llm-app.md) (59,341 ⭐) — This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows.

The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
- [densitydesign/raw](https://awesome-repositories.com/repository/densitydesign-raw.md) (333 ⭐) — The missing link between spreadsheets and data visualization
- [datahub-project/datahub](https://awesome-repositories.com/repository/datahub-project-datahub.md) (12,141 ⭐) — DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations.

The platform distinguishes itself through its focus on grounding artificial intelligence and autono
- [getmoto/moto](https://awesome-repositories.com/repository/getmoto-moto.md) (8,550 ⭐) — Moto is a cloud service mockery framework and API mock server that simulates AWS infrastructure locally. It allows developers to test cloud-dependent code and verify infrastructure-as-code templates without deploying real resources or incurring costs.

The project functions as an SDK interceptor that can patch existing service clients to redirect requests to a local mock environment. It can also be run as a standalone HTTP server, enabling any programming language to interact with the simulated endpoints.

The framework covers a vast array of simulated capabilities, including data storage, com
- [estuary/estuary-warehouse-benchmark](https://awesome-repositories.com/repository/estuary-estuary-warehouse-benchmark.md) (2 ⭐) — 👉 Check out the full report here: https://estuary.dev/data-warehouse-benchmark-report/
- [slrbl/human-in-the-loop-machine-learning-tool-tornado](https://awesome-repositories.com/repository/slrbl-human-in-the-loop-machine-learning-tool-tornado.md) (69 ⭐) — Tornado is an open source Human-in-the-loop machine learning tool. It helps you label your dataset on the fly while training your model through a simple web user interface. It supports all data types: structured, text and image.
- [databricks/spark-the-definitive-guide](https://awesome-repositories.com/repository/databricks-spark-the-definitive-guide.md) (3,099 ⭐) — This project is an educational resource and technical manual for Apache Spark, focused on the architecture and practical application of large-scale data processing. It serves as a guide for big data engineering and distributed computing, covering the principles of parallel processing and fault-tolerant data distribution.

The material provides instructional content on designing distributed ETL pipelines and implementing data analysis workflows. It includes tutorials for polyglot data processing, offering patterns and examples for using Python, Scala, and Java within a unified environment.

The
- [then/promise](https://awesome-repositories.com/repository/then-promise.md) (0 ⭐) — This is a simple implementation of Promises. It is a super set of ES6 Promises designed to have readable, performant code and to provide just the extensions that are absolutely necessary for using promises today.
- [zipstack/unstract](https://awesome-repositories.com/repository/zipstack-unstract.md) (6,669 ⭐) — Unstract is an unstructured data extraction system and ETL pipeline orchestrator that uses large language models to convert documents, images, and scans into structured JSON. It provides a document extraction API for integrating these capabilities into external automation tools and includes a Model Context Protocol server to connect AI agents to structured information retrieval.

The system ensures data accuracy through a verification tool featuring dual-model verification and human-in-the-loop review with coordinate-based document highlighting. It utilizes natural language extraction schemas
- [bigskysoftware/htmx](https://awesome-repositories.com/repository/bigskysoftware-htmx.md) (48,210 ⭐) — HTMX is a hypermedia-driven frontend library that enables the creation of dynamic, asynchronous web applications by extending standard HTML attributes. It functions as a declarative engine that intercepts browser events to trigger network requests, allowing developers to update specific regions of the document with server-rendered HTML fragments. By shifting the logic of UI composition to the server, it minimizes the need for complex client-side state management and imperative JavaScript.

The library distinguishes itself through a progressive enhancement workflow that ensures web interfaces r
- [pypa/warehouse](https://awesome-repositories.com/repository/pypa-warehouse.md) (4,068 ⭐) — The Python Package Index
- [pypi/warehouse](https://awesome-repositories.com/repository/pypi-warehouse.md) (4,068 ⭐) — The Python Package Index
- [fastai/fastai](https://awesome-repositories.com/repository/fastai-fastai.md) (27,862 ⭐) — Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models.

The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
- [apache/seatunnel](https://awesome-repositories.com/repository/apache-seatunnel.md) (9,427 ⭐) — SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance.

The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
- [randomfractals/pro-data-tools](https://awesome-repositories.com/repository/randomfractals-pro-data-tools.md) (41 ⭐) — Random Fractals Inc. Data Tools 🛠️ is a collection of public data visualization extensions, data viewers, VS Code Notebook renderers, and code snippets for devs and data scientists using VS Code IDE, published under our Random Fractals Inc. ☂️ org.
- [appsmithorg/appsmith](https://awesome-repositories.com/repository/appsmithorg-appsmith.md) (40,051 ⭐) — Appsmith is a low-code platform designed for building internal business tools, such as operational dashboards and administrative panels. It enables developers to construct dynamic user interfaces by dragging and dropping modular widgets onto a canvas and binding them directly to backend data sources. The platform utilizes a reactive framework that automatically updates interface elements and triggers functions whenever underlying data or widget properties change, eliminating the need for manual event handling.

The platform distinguishes itself through a server-side proxy architecture that exe
- [vonng/ddia](https://awesome-repositories.com/repository/vonng-ddia.md) (22,648 ⭐) — This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure.

The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
- [jaykali/maskphish](https://awesome-repositories.com/repository/jaykali-maskphish.md) (3,020 ⭐) — Maskphish is a comprehensive security toolkit that integrates capabilities for digital forensics, network vulnerability scanning, open-source intelligence, penetration testing, and social engineering. It functions as a multi-purpose framework for automating reconnaissance and executing security audits across diverse network environments.

The project features a specialized phishing and social engineering toolkit used for cloning websites, masking URLs, and deploying deceptive pages to capture user credentials. It also includes a remote access Trojan builder for generating platform-specific exe
- [markdown-it/markdown-it](https://awesome-repositories.com/repository/markdown-it-markdown-it.md) (21,575 ⭐) — markdown-it is a token-based Markdown compiler and CommonMark-compliant parser that converts structured plaintext markup into HTML. It functions as an extensible markup processor designed to transform text into browser-ready content while managing security and preventing cross-site scripting.

The project is distinguished by a modular plugin system that allows for the extension of parsing capabilities and the addition of custom syntax, such as footnotes, tables, or emojis. It utilizes a two-stage tokenization process to break documents into structural tokens before rendering them into final HT
- [jd-opensource/joyagent-jdgenie](https://awesome-repositories.com/repository/jd-opensource-joyagent-jdgenie.md) (11,350 ⭐) — Joyagent-jdgenie is an automated data orchestrator designed to centralize the retrieval and processing of information from disparate remote sources. It functions as a framework for building repeatable data pipelines that fetch, clean, and normalize raw input into consistent, structured formats.

The system utilizes a schema-driven engine to apply validation rules and structural templates to incoming data, ensuring compatibility across enterprise systems. By employing configuration-based workflow definitions, it allows for the orchestration of modular tasks into automated execution flows, separ
- [jqlang/jq](https://awesome-repositories.com/repository/jqlang-jq.md) (34,901 ⭐) — This project is a command-line processor designed for the parsing, filtering, and transformation of structured data streams. It functions as a declarative programming environment that treats data as immutable streams, allowing users to perform complex structural modifications through the composition of small, reusable functions. By utilizing a recursive tree traversal engine, the system enables the navigation, inspection, and modification of deeply nested hierarchical data structures.

The engine distinguishes itself through a stream-oriented architecture that processes input records one by on
- [dokploy/dokploy](https://awesome-repositories.com/repository/dokploy-dokploy.md) (34,901 ⭐) — Dokploy is a self-hosted platform-as-a-service designed to simplify the deployment and management of containerized applications and databases. It provides a centralized control plane that decouples administrative management from application workloads, allowing users to oversee infrastructure across multiple server nodes through a unified web interface or a command-line tool.

The platform distinguishes itself through an extensive library of pre-configured application templates, enabling the rapid deployment of databases, identity providers, and various productivity or development tools. It sup
- [mdlayher/raw](https://awesome-repositories.com/repository/mdlayher-raw.md) (0 ⭐)
- [kubeshark/kubeshark](https://awesome-repositories.com/repository/kubeshark-kubeshark.md) (11,954 ⭐) — Kubeshark is a network observability platform designed for Kubernetes environments, functioning as an eBPF-powered engine for cluster-wide traffic analysis. It captures, indexes, and visualizes network activity and API calls directly from the kernel, providing deep visibility into service-to-service communication without requiring sidecar proxies or manual code instrumentation.

The platform distinguishes itself through its ability to perform protocol-aware traffic dissection and user-space cryptographic hooking, which allows for the inspection of encrypted traffic and the reconstruction of ap
- [universaldatatool/universal-data-tool](https://awesome-repositories.com/repository/universaldatatool-universal-data-tool.md) (2,068 ⭐) — Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
- [bitfield/script](https://awesome-repositories.com/repository/bitfield-script.md) (6,991 ⭐) — This project is a Go shell scripting library and framework designed for writing automation scripts and CLI tools. It provides a concurrent data pipeline system for chaining sources, filters, and sinks to process text and JSON streams.

The library distinguishes itself through a comprehensive toolkit for shell-like operations, including a text processing engine for regular expression filtering and frequency analysis, a filesystem utility toolkit for recursive search and path manipulation, and an integrated HTTP client wrapper for building data pipelines that fetch web content.

The capability s
- [haifengl/smile](https://awesome-repositories.com/repository/haifengl-smile.md) (6,387 ⭐) — Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models.

The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
- [alteryx/featuretools](https://awesome-repositories.com/repository/alteryx-featuretools.md) (7,658 ⭐) — Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data.

The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets.

The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
- [raw-labs/mxcp](https://awesome-repositories.com/repository/raw-labs-mxcp.md) (69 ⭐) — Model eXecution + Context Protocol: Enterprise-Grade Data-to-AI Infrastructure
- [avelino/awesome-go](https://awesome-repositories.com/repository/avelino-awesome-go.md) (175,576 ⭐) — This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently discover resources tailored to specific functional domains.

The repository distinguishes itself through a decentralized contribution model, where community-driven updates ensure the index remains current with the rapidly evolving software landscape. Beyond simple resource listing,
- [gokumohandas/made-with-ml](https://awesome-repositories.com/repository/gokumohandas-made-with-ml.md) (48,343 ⭐) — Made-With-ML is an automated documentation generator and developer experience platform designed to transform source code into structured, searchable reference websites. It functions as a codebase intelligence tool that parses implementation details to provide clear explanations of logic and data requirements.

The system distinguishes itself by leveraging language-level type annotations and structured code comments to generate interface specifications. By utilizing static analysis to extract metadata, it automates the transformation of docstrings into web-ready documentation, ensuring that tec
- [release-it/release-it](https://awesome-repositories.com/repository/release-it-release-it.md) (8,975 ⭐) — release-it is a Git release automation tool designed to coordinate software versioning, changelog generation, and package publishing. It functions as a semantic versioning manager that increments project versions and updates configuration files based on semantic standards or custom schemes.

The project distinguishes itself through a plugin-based extension system that allows for custom versioning and publishing logic. It supports complex project structures via monorepo versioning automation to synchronize internal dependencies across multiple workspaces.

The tool covers a broad range of capab
- [fox-it/cobaltstrike-beacon-data](https://awesome-repositories.com/repository/fox-it-cobaltstrike-beacon-data.md) (136 ⭐) — https://research.nccgroup.com/2022/03/25/mining-data-from-cobalt-strike-beacons/
- [letta-ai/letta](https://awesome-repositories.com/repository/letta-ai-letta.md) (21,168 ⭐) — Letta is a framework for building, deploying, and managing autonomous AI agents that maintain persistent state across long-term interactions. It provides a comprehensive suite of primitives for defining agents with configurable personas, modular memory blocks, and tool-use capabilities, enabling them to retain user preferences and conversation history over extended sessions.

The platform distinguishes itself through its advanced memory management and orchestration capabilities. It allows agents to autonomously update their own memory, perform retrieval-augmented generation, and coordinate com
- [plotly/plotly.py](https://awesome-repositories.com/repository/plotly-plotly-py.md) (18,270 ⭐) — Plotly.py is a comprehensive framework for building production-ready data applications and interactive dashboards directly from Python code. It functions as both a high-performance visualization library for browser-based charts and a full-stack tool for transforming analytical scripts into responsive, web-based interfaces. By abstracting away the need for manual HTML or JavaScript, it allows developers to define complex layouts and functional logic using modular, reusable components.

The framework distinguishes itself through a robust architecture that handles event orchestration and state sy
- [coffcer/vue-loading](https://awesome-repositories.com/repository/coffcer-vue-loading.md) (137 ⭐) — vue1 directive, show loading block in any element
- [harvardnlp/annotated-transformer](https://awesome-repositories.com/repository/harvardnlp-annotated-transformer.md) (7,325 ⭐) — The Annotated Transformer is an educational resource that provides annotated code implementations of the Transformer architecture for sequence-to-sequence tasks, built with PyTorch. It serves as a learning tool for understanding attention mechanisms, multi-head parallel attention, and scaled dot-product attention through executable examples that walk through each component of the model.

The project covers the full Transformer pipeline, including stacked encoder-decoder layers with residual connections and layer normalization, sinusoidal positional encoding for order-aware representation, and
- [freshos/then](https://awesome-repositories.com/repository/freshos-then.md) (996 ⭐) — :clapper: Tame async code with battle-tested promises
- [pytorch/vision](https://awesome-repositories.com/repository/pytorch-vision.md) (17,743 ⭐) — This project is a comprehensive computer vision library for the PyTorch ecosystem, providing a standardized collection of neural network architectures, datasets, and high-performance transformation utilities. It serves as a foundational framework for building, training, and deploying deep learning models, offering a centralized model registry that allows developers to instantiate architectures with pre-trained weights for tasks such as image classification, object detection, and semantic segmentation.

The library distinguishes itself through its modular approach to data and compute management