8 repos

Awesome GitHub RepositoriesData Ingestion and Preparation

Tools focused on the initial stages of the pipeline, including loading, formatting, and augmenting raw data for model consumption.

Explore 8 awesome GitHub repositories matching artificial intelligence & ml · Data Ingestion and Preparation. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

nomic-ai/gpt4all
nomic-ai/gpt4all
77,146GitHubView on GitHub
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a compreh
Converts text into vector representations locally to support semantic search and retrieval without cloud-based services.
C++ai-chatllm-inference
redis/redis
redis/redis
73,096GitHubView on GitHub
Redis is an in-memory, key-value database designed to provide sub-millisecond latency for read and write operations. It functions as a versatile data platform, serving as a distributed cache, a message broker, a NoSQL document store, and a vector database. The system utilizes an event-driven, single-threaded loop to pr
Accelerates machine learning workflows by serving pre-computed features directly from high-speed memory.
Ccachecachingdatabase
tesseract-ocr/tesseract
tesseract-ocr/tesseract
72,460GitHubView on GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
Provides specialized interfaces for preparing and editing raw image data to facilitate model training.
C++hacktoberfestlstmmachine-learning
zylon-ai/private-gpt
zylon-ai/private-gpt
57,116GitHubView on GitHub
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov
Encodes raw text into high-dimensional vector representations to facilitate efficient machine learning model consumption and semantic search operations.
Python
ultralytics/yolov5
ultralytics/yolov5
56,830GitHubView on GitHub
YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning
Applies geometric and color-based image modifications during the training pipeline to enhance model robustness.
Pythoncoremldeep-learningios
deepfakes/faceswap
deepfakes/faceswap
54,974GitHubView on GitHub
Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users
Facilitates the retrieval and ingestion of training datasets while supporting multi-input models and visual selection.
Pythondeep-face-swapdeep-learningdeep-neural-networks
karpathy/nanoGPT
karpathy/nanoGPT
53,461GitHubView on GitHub
nanoGPT is a lightweight engine for training and fine-tuning transformer-based language models from scratch. It provides a minimalist codebase designed for educational exploration and rapid experimentation with neural network architectures, utilizing self-attention and feed-forward layers to process sequences and predi
Converts raw text corpora into optimized binary formats to accelerate data ingestion during training.
Python
unslothai/unsloth
unslothai/unsloth
52,461GitHubView on GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade
Structures raw text into organized question-answer pairs and generates synthetic data using local resources.
Pythonagentdeepseekdeepseek-r1

Explore sub-tags

8 repos

Awesome GitHub RepositoriesData Ingestion and Preparation

Tools focused on the initial stages of the pipeline, including loading, formatting, and augmenting raw data for model consumption.

Explore 8 awesome GitHub repositories matching artificial intelligence & ml · Data Ingestion and Preparation. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

nomic-ai/gpt4all
nomic-ai/gpt4all
77,146GitHubView on GitHub
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a compreh
Converts text into vector representations locally to support semantic search and retrieval without cloud-based services.
C++ai-chatllm-inference
redis/redis
redis/redis
73,096GitHubView on GitHub
Redis is an in-memory, key-value database designed to provide sub-millisecond latency for read and write operations. It functions as a versatile data platform, serving as a distributed cache, a message broker, a NoSQL document store, and a vector database. The system utilizes an event-driven, single-threaded loop to pr
Accelerates machine learning workflows by serving pre-computed features directly from high-speed memory.
Ccachecachingdatabase
tesseract-ocr/tesseract
tesseract-ocr/tesseract
72,460GitHubView on GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
Provides specialized interfaces for preparing and editing raw image data to facilitate model training.
C++hacktoberfestlstmmachine-learning
zylon-ai/private-gpt
zylon-ai/private-gpt
57,116GitHubView on GitHub
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov
Encodes raw text into high-dimensional vector representations to facilitate efficient machine learning model consumption and semantic search operations.
Python
ultralytics/yolov5
ultralytics/yolov5
56,830GitHubView on GitHub
YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning
Applies geometric and color-based image modifications during the training pipeline to enhance model robustness.
Pythoncoremldeep-learningios
deepfakes/faceswap
deepfakes/faceswap
54,974GitHubView on GitHub
Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users
Facilitates the retrieval and ingestion of training datasets while supporting multi-input models and visual selection.
Pythondeep-face-swapdeep-learningdeep-neural-networks
karpathy/nanoGPT
karpathy/nanoGPT
53,461GitHubView on GitHub
nanoGPT is a lightweight engine for training and fine-tuning transformer-based language models from scratch. It provides a minimalist codebase designed for educational exploration and rapid experimentation with neural network architectures, utilizing self-attention and feed-forward layers to process sequences and predi
Converts raw text corpora into optimized binary formats to accelerate data ingestion during training.
Python
unslothai/unsloth
unslothai/unsloth
52,461GitHubView on GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade
Structures raw text into organized question-answer pairs and generates synthetic data using local resources.
Pythonagentdeepseekdeepseek-r1

Awesome Data Ingestion and Preparation GitHub Repositories

nomic-ai/gpt4all

redis/redis

tesseract-ocr/tesseract

zylon-ai/private-gpt

ultralytics/yolov5

deepfakes/faceswap

karpathy/nanoGPT

unslothai/unsloth

Explore sub-tags

Awesome Data Ingestion and Preparation GitHub Repositories

nomic-ai/gpt4all

redis/redis

tesseract-ocr/tesseract

zylon-ai/private-gpt

ultralytics/yolov5

deepfakes/faceswap

karpathy/nanoGPT

unslothai/unsloth

Explore sub-tags