awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Data Ingestion and Preparation · Awesome GitHub Repositories

8 repos

Awesome GitHub RepositoriesData Ingestion and Preparation

Tools focused on the initial stages of the pipeline, including loading, formatting, and augmenting raw data for model consumption.

Explore 8 awesome GitHub repositories matching artificial intelligence & ml · Data Ingestion and Preparation. Refine with filters or upvote what's useful.

  1. Home
  2. Artificial Intelligence & ML
  3. Machine Learning
  4. Infrastructure
  5. Data Ingestion and Preparation

Awesome Data Ingestion and Preparation GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • nomic-ai/gpt4all

    nomic-ai/gpt4all

    77,146GitHubView on GitHub↗

    GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a compreh

    Converts text into vector representations locally to support semantic search and retrieval without cloud-based services.

    C++ai-chatllm-inference
  • redis/redis

    redis/redis

    73,096GitHubView on GitHub↗

    Redis is an in-memory, key-value database designed to provide sub-millisecond latency for read and write operations. It functions as a versatile data platform, serving as a distributed cache, a message broker, a NoSQL document store, and a vector database. The system utilizes an event-driven, single-threaded loop to pr

    Accelerates machine learning workflows by serving pre-computed features directly from high-speed memory.

    Ccachecachingdatabase
  • tesseract-ocr/tesseract

    tesseract-ocr/tesseract

    72,460GitHubView on GitHub↗

    Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d

    Provides specialized interfaces for preparing and editing raw image data to facilitate model training.

    C++hacktoberfestlstmmachine-learning
  • zylon-ai/private-gpt

    zylon-ai/private-gpt

    57,116GitHubView on GitHub↗

    This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov

    Encodes raw text into high-dimensional vector representations to facilitate efficient machine learning model consumption and semantic search operations.

    Python
  • ultralytics/yolov5

    ultralytics/yolov5

    56,830GitHubView on GitHub↗

    YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning

    Applies geometric and color-based image modifications during the training pipeline to enhance model robustness.

    Pythoncoremldeep-learningios
  • deepfakes/faceswap

    deepfakes/faceswap

    54,974GitHubView on GitHub↗

    Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users

    Facilitates the retrieval and ingestion of training datasets while supporting multi-input models and visual selection.

    Pythondeep-face-swapdeep-learningdeep-neural-networks
  • karpathy/nanoGPT

    karpathy/nanoGPT

    53,461GitHubView on GitHub↗

    nanoGPT is a lightweight engine for training and fine-tuning transformer-based language models from scratch. It provides a minimalist codebase designed for educational exploration and rapid experimentation with neural network architectures, utilizing self-attention and feed-forward layers to process sequences and predi

    Converts raw text corpora into optimized binary formats to accelerate data ingestion during training.

    Python
  • unslothai/unsloth

    unslothai/unsloth

    52,461GitHubView on GitHub↗

    Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade

    Structures raw text into organized question-answer pairs and generates synthetic data using local resources.

    Pythonagentdeepseekdeepseek-r1

Explore sub-tags

  • Data Augmentation1 sub-tagTechniques and pipelines used to artificially expand training datasets by creating modified versions of existing data.
  • Data Preparation ToolsUtilities designed to clean, format, and transform raw data into a structure suitable for machine learning ingestion.
  • Dataset LoadersSoftware components that automate the retrieval and loading of datasets into machine learning training pipelines.
  • Dataset Preprocessing Utilities
Tools for converting raw data into optimized binary formats for efficient model ingestion.