awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Pretraining Data Pipelines · Awesome GitHub Repositories

1 repo

Awesome GitHub RepositoriesPretraining Data Pipelines

Frameworks for processing and organizing large-scale raw text corpora for foundational model training.

Distinguishing note: Focuses on raw text-to-text sequence organization for foundational pretraining, distinct from fine-tuning or alignment tasks.

Explore 1 awesome GitHub repository matching artificial intelligence & ml · Pretraining Data Pipelines. Refine with filters or upvote what's useful.

  1. Home
  2. Artificial Intelligence & ML
  3. Pretraining Data Pipelines

Awesome Pretraining Data Pipelines GitHub Repositories

Describe the repository you're looking for…
Find the best repos with AI.We'll search the best matching repositories with AI.
  • jingyaogong/minimind

    jingyaogong/minimind

    39,663View on GitHub↗

    This project is a comprehensive framework for the entire lifecycle of transformer-based language models, supporting everything from foundational pretraining to specialized deployment. It provides a modular toolkit for defining neural network architectures, managing data preparation pipelines, and executing training routines across various scales. The framework is designed to handle the full model development process, including supervised fine-tuning, behavioral alignment, and the integration of agentic capabilities. What distinguishes this framework is its focus on efficient training and adva

    The framework enables the organization of raw text corpora into text-to-text sequences, ensuring consistent data distribution and controlled lengths for foundational language model pretraining.

    Pythonartificial-intelligencelarge-language-model
    39,663View on GitHub↗