23 repository-uri
Utilities for tokenizing, cleaning, and formatting raw datasets for machine learning model training.
Distinguishing note: Specifically tailored for preparing text and dialogue data for LLM training, distinct from general data ETL.
Explore 23 awesome GitHub repositories matching artificial intelligence & ml · Data Preprocessing Pipelines. Refine with filters or upvote what's useful.
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Transforms raw web content into clean markdown specifically designed to serve as high-quality context for LLM prompts.
This project is a comprehensive framework for the entire lifecycle of transformer-based language models, supporting everything from foundational pretraining to specialized deployment. It provides a modular toolkit for defining neural network architectures, managing data preparation pipelines, and executing training routines across various scales. The framework is designed to handle the full model development process, including supervised fine-tuning, behavioral alignment, and the integration of agentic capabilities. What distinguishes this framework is its focus on efficient training and adva
A set of utilities for tokenizing, formatting, and structuring raw text and dialogue datasets for efficient model training.
This project is a comprehensive library of state-of-the-art neural network architectures designed for image classification and feature extraction. It provides a complete deep learning training framework that supports distributed execution, allowing users to build, train, and fine-tune vision models using optimized schedulers and pre-configured training recipes. The library distinguishes itself through a modular backbone architecture that treats neural networks as decoupled feature extractors, enabling the retrieval of multi-scale outputs for downstream tasks like object detection and segmenta
"Preprocessing operations are automatically resolved and matched to the specific input requirements of a chosen model architecture at runtime."
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Transforms raw text into training-ready formats by applying tokenization and creating language-specific dictionaries.
This project is a professional development repository that provides structured learning paths for individuals pursuing careers in data-centric engineering and artificial intelligence. It functions as a competency benchmarking framework, defining the core knowledge areas and technical milestones required to achieve proficiency in specialized domains. The repository distinguishes itself through hierarchical knowledge graphing, which organizes complex technical subjects into nested tree structures to create clear, progressive learning sequences. By centralizing curated educational resources and
Provides automated workflows for cleaning, encoding, and structuring raw data to ensure compatibility with predictive modeling requirements.
InsightFace is a comprehensive deep learning framework designed for face recognition, biometric identity verification, and feature extraction. It provides a specialized engine for one-to-one verification and one-to-many identification tasks, utilizing convolutional neural networks to transform raw image pixels into high-dimensional vector embeddings. The project includes a complete toolkit for detecting, aligning, and processing facial data to ensure consistent identity discrimination. Beyond core recognition, the platform distinguishes itself through an extensive model management and optimiz
Standardizes input data through automated face detection, landmark alignment, and cropping to ensure consistent feature extraction across varying conditions.
Gitingest is a tool for extracting, converting, and estimating the token size of codebases to facilitate ingestion by large language models. It transforms GitHub repositories and local directories into a single formatted text file that serves as a structured context window for model analysis. The utility includes a codebase token estimator to calculate file structure and total token counts, helping to determine the scale of the extracted content. It supports both public and private repositories through token-based authentication and respects gitignore configurations to filter out irrelevant p
Converts repository data into a formatted text structure specifically prepared as context for LLM prompts.
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
Splits input text into sub-word units using byte-level, whitespace, or regex-based strategies for neural network consumption.
GenericAgent is an LLM agent framework and autonomous system controller designed to manage local systems, web browsers, and hardware interfaces through action and observation loops. It functions as a tool orchestrator that routes model calls to local executors, enabling the automation of complex tasks on a host machine. The project is distinguished by its self-evolving AI agent capabilities, which convert successful execution paths into reusable procedural scripts and skill trees to reduce future reasoning overhead. It employs a context optimization engine that utilizes layered memory hierarc
Tracks conversation length using a character-domain formula to manage token limits independently of specific model tokenizers.
This project is an LLM knowledge base builder and personal knowledge management tool. It is a desktop application designed to transform diverse documents into a persistent, interlinked wiki through LLM analysis and incremental ingestion. The system distinguishes itself with a knowledge graph visualizer that uses community detection algorithms to map relationships between concepts and identify topical clusters. It features a hybrid retrieval system that combines keyword matching, vector embeddings, and graph relevance to locate information. The platform covers a wide range of capabilities inc
Includes a configuration system to manage token budgets for wiki pages, chat history, and system prompts.
Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies. The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
Tokenizes entire datasets in memory before training begins to optimize performance for smaller datasets.
SpeechBrain is an all-in-one deep learning toolkit designed for speech and audio processing. Built as a modular library, it provides a structured environment for developing, training, and deploying neural network models across a wide range of tasks, including automatic speech recognition, speaker identification, and audio enhancement. The framework distinguishes itself through a configuration-driven approach that separates model architecture and training hyperparameters from application logic. By utilizing externalized configuration files and standardized recipes, it enables reproducible rese
Standardizes audio dataset loading, augmentation, and preprocessing through unified interfaces for machine learning training.
Reader is an AI data ingestion pipeline and web content parser designed to convert websites and documents into clean markdown for use with large language models. It functions as a headless browser content extractor and web-to-markdown converter, transforming URLs and PDF files into structured text formats while removing irrelevant web clutter. The system optimizes retrieval augmented generation by acting as a search optimizer that retrieves web results and applies re-ranking to improve context relevance. It further enhances content accessibility by using vision models to generate descriptive
Converts unstructured web and document data into clean markdown to provide high-quality context for LLMs.
Trax is a deep learning framework and hardware-agnostic tensor engine designed for designing and training neural networks. It serves as a research tool providing high-level combinators for composing complex architectures, alongside a dedicated library for building transformer models and a toolkit for reinforcement learning. The framework is distinguished by its support for reversible and sparse transformer architectures, which reduce memory and computational overhead. It enables a single set of model instructions to execute across different hardware backends without changing the underlying co
Provides utilities for tokenizing, shuffling, and formatting raw datasets for machine learning training.
Omniparse is a multimodal content parser and generative AI ingestion engine designed to convert documents, images, and multimedia into a uniform format. It functions as a data preprocessing pipeline that transforms diverse raw data sources into structured markdown to improve the performance of large language model workflows. The system extracts text and structural data from PDFs, images, audio, and video files. It includes a web crawler that converts dynamic website content into clean markdown and a multimodal transformation process that maps disparate input formats into a unified data schema
Provides a comprehensive pipeline to clean and format diverse raw data sources specifically for large language model workflows.
Glass is an AI desktop assistant and screen-to-LLM interface that processes visual and auditory context from a computer to automate tasks. It functions as a tool for screen analysis, bridging real-time desktop captures with large language models to extract semantic meaning and data insights. The system enables AI-assisted desktop interaction by recording live screen and audio data to provide a persistent digital memory for processing. This allows the application to analyze visible screen information and trigger automation workflows through global keyboard shortcuts.
Extracts semantic meaning from captured screen and audio buffers using large language models.
This project is an educational implementation guide and framework for building Retrieval Augmented Generation systems. It provides a workflow for constructing a knowledge base pipeline that partitions documents, indexes them as vectors, and provides external context for language model prompts. The system features a document chunking framework that uses recursive character splitting to fit text into model context windows. It includes an in-memory vector store and a similarity search system that retrieves relevant text segments by calculating the mathematical distance between dense embedding ve
Dynamically prepares and inserts relevant information into LLM prompts to improve response quality.
code2prompt is a codebase-to-prompt converter and LLM context generator that transforms source code and directory structures into formatted text blocks for large language models. It functions as both a utility for generating prompts and an AI agent context server that exposes codebase files and metadata to coding assistants via a standardized server protocol. The tool distinguishes itself through git-aware capabilities, integrating commit messages and branch diffs to provide version control context for AI-generated code changes. It also utilizes the Model Context Protocol to allow external AI
Aggregates source code and directory structures into a single formatted text block for LLM prompts.
Claude-context is a retrieval-augmented generation pipeline and semantic code search tool. It functions as an LLM codebase indexer and RAG context provider, designed to index local directories and retrieve relevant code files to provide context for large language models. The system operates as a hybrid search engine that combines keyword matching with dense vector search. This allows for the retrieval of code snippets and logic using natural language queries based on meaning rather than exact text matches. The project covers codebase indexing and search index management, utilizing asynchrono
Prepares and filters relevant code files to provide AI assistants with precise context for accurate responses.
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Converts raw datasets into structured formats by generating prompts and parsing ground truth answers for model training.