What are the best open-source alternatives to RedPajama Data?

30 open-source projects similar to togethercomputer/redpajama-data, ranked by shared features. Top picks: datajuicer/data-juicer, esbatmop/mnbvc, eleutherai/gpt-neox, facebookresearch/metaseq, huggingface/peft, langchain-ai/langchain, lm-sys/fastchat, langchain-ai/langsmith-sdk, linksoul-ai/chinese-llama-2-7b, huggingface/accelerate.

Is datajuicer/data-juicer a good alternative to RedPajama Data?

Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of…

Is esbatmop/mnbvc a good alternative to RedPajama Data?

MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms,…

Is eleutherai/gpt-neox a good alternative to RedPajama Data?

gpt-neox is a distributed training system and framework for building large-scale autoregressive language models. It implements the transformer architecture and provides a toolkit for training models with billions of parameters by distributing weights across compute clusters. The framework distingu…

Is facebookresearch/metaseq a good alternative to RedPajama Data?

Metaseq is a transformer sequence modeling toolkit designed for training, fine-tuning, and deploying sequence-to-sequence models using open pre-trained weights. It provides a comprehensive framework for large language model training, including dedicated tools for sequence dataset processing and a s…

Is huggingface/peft a good alternative to RedPajama Data?

This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requ…

Is langchain-ai/langchain a good alternative to RedPajama Data?

LangChain is an orchestration framework designed for building, managing, and deploying applications powered by large language models. It provides a unified integration layer that normalizes disparate model provider APIs into a consistent set of primitives, enabling developers to build complex, mult…

Is lm-sys/fastchat a good alternative to RedPajama Data?

FastChat is a training and serving platform for large language models that provides an integrated toolkit for fine-tuning, hosting, and benchmarking chatbots. It functions as an inference server capable of hosting multiple models and exposing them via a standardized API for chat applications. The…

Is langchain-ai/langsmith-sdk a good alternative to RedPajama Data?

This repository contains the Python and Javascript SDK's for interacting with the LangSmith platform. Please see LangSmith Documentation for documentation about using the LangSmith platform and the client SDK.

Is linksoul-ai/chinese-llama-2-7b a good alternative to RedPajama Data?

开源社区第一个能下载、能运行的中文 LLaMA2 模型！

Is huggingface/accelerate a good alternative to RedPajama Data?

Accelerate is a PyTorch distributed training library that abstracts the boilerplate required to run models across multiple GPUs, TPUs, and CPUs. It functions as a deep learning model scaler and distributed hardware orchestrator, allowing the same training script to run on different hardware backend…

Back to togethercomputer/redpajama-data

Open-source alternatives to RedPajama Data

30 open-source projects similar to togethercomputer/redpajama-data, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best RedPajama Data alternative.

datajuicer/data-juicer
datajuicer/data-juicer
6,574View on GitHub
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Pythondatadata-analysisdata-pipeline
View on GitHub6,574
esbatmop/mnbvc
esbatmop/MNBVC
4,123View on GitHub
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
chinesechinese-languagechinese-nlp
View on GitHub4,123
eleutherai/gpt-neox
EleutherAI/gpt-neox
7,392View on GitHub
gpt-neox is a distributed training system and framework for building large-scale autoregressive language models. It implements the transformer architecture and provides a toolkit for training models with billions of parameters by distributing weights across compute clusters. The framework distinguishes itself through extensive support for distributed model parallelism, including pipeline and sequence parallelism, to overcome single-device memory limits. It further supports sparse model architectures using a mixture of experts system with Sinkhorn-based routing. The project covers a broad ran
Pythondeepspeed-librarygpt-3language-model
View on GitHub7,392

Open-source alternatives to RedPajama Data

datajuicer/data-juicer

esbatmop/MNBVC

EleutherAI/gpt-neox

facebookresearch/metaseq

huggingface/peft

langchain-ai/langchain

lm-sys/FastChat

langchain-ai/langsmith-sdk

LinkSoul-AI/Chinese-Llama-2-7b

huggingface/accelerate

higgsfield-ai/higgsfield

facebookresearch/llama-recipes

intel-analytics/BigDL

jzhang38/TinyLlama

hpcaitech/ColossalAI

hppRC/llm-lora-classification

bigscience-workshop/petals

chroma-core/chroma

facebookresearch/fairscale

bigcode-project/starcoder2

hiyouga/LLaMA-Factory

dvmazur/mixtral-offloading

facebookresearch/llama

huggingface/optimum

databrickslabs/dolly

lxe/simple-llm-finetuner

artidoro/qlora

Facico/Chinese-Vicuna

BerriAI/litellm

masa3141/japanese-alpaca-lora