30 open-source projects similar to codefuse-ai/mftcoder, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best MFTCoder alternative.
🎩 Models | 📚 Dataset | 🚀 Quick Start | 👀 Demo | 📝 Citation | 🙏 Acknowledgements
Open Llama is an open source large language model and pre-trained transformer designed as a permissively licensed alternative to proprietary weights. It serves as a base model reproduction of the Llama architecture, providing a set of weights for a decoder-only transformer. The project provides a transparently trained model based on the RedPajama dataset, supporting unrestricted commercial and research use. It includes systems for serving pre-trained weights in various sizes. The project covers natural language processing research and performance benchmarking through text quality evaluation
labelImg is a computer vision labeling tool and image bounding box annotator used to create training datasets for machine learning models. It functions as a desktop utility for drawing rectangular labels on images and saving object coordinates and class names in common machine learning formats. The tool is specifically designed to generate and edit PascalVOC formatted XML files and create image labels in the text-based format required by YOLO object detection pipelines. The software covers object detection annotation and training data preparation, including the ability to manage label catego
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
3FS is a distributed file system and RDMA storage cluster designed for high-performance AI training and inference workloads. It functions as a strongly consistent storage layer that utilizes a disaggregated architecture to pool SSDs and memory resources across multiple nodes. The system provides specialized storage implementations including an AI training checkpoint store for parallel state preservation and a distributed key-value cache store for decoder layer vectors to optimize inference processing. It ensures data integrity through chain replication and apportioned query distribution. The
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
Vendor-agnostic orchestration for training, inference and agentic workloads across NVIDIA, AMD, TPU, and Tenstorrent on clouds, Kubernetes, and bare metal.
Code Llama is a large language model based on Llama 2 trained specifically for programming tasks and software development. It provides specialized model types optimized for general code generation, instruction following, and context-aware infilling. The project includes an instruction-tuned programming model for executing technical tasks via natural language prompts and a code infilling model that predicts missing sections based on surrounding source context. A large context code model is also provided to analyze extensive blocks of source code for improved coherence. The system covers capab
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Skaffold is a command-line tool that automates the build, push, and deployment lifecycle for containerized applications on Kubernetes. It functions as a continuous development engine, monitoring source code for changes to trigger incremental updates, manifest hydration, and automated deployments to a cluster. By abstracting the underlying build and deployment tools, it provides a unified interface for managing the inner development loop. The platform distinguishes itself through its environment-aware configuration and flexible build orchestration. It supports diverse build strategies, includi
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
This project is a multimodal model trainer and machine learning fine-tuning tool that provides a containerized workflow for adapting pre-trained models to specific tasks. It features a no-code web interface and a dashboard for training large language models and other machine learning datasets without writing code. The system distinguishes itself by integrating a no-code interface with remote GPU orchestration, allowing users to deploy containerized training environments on cloud infrastructure or local hardware. It includes a dedicated integrator for uploading trained model weights and config
Minimalistic large language model 3D-parallelism training
CML is a pipeline automation tool for training and evaluating machine learning models, functioning as a CI/CD system for machine learning. It serves as a cloud compute orchestrator and Git-based workflow manager that automates model training cycles through branch management, automated commits, and integrated reporting. The project distinguishes itself by provisioning ephemeral cloud instances or Kubernetes nodes to provide specialized hardware for compute-heavy tasks. It also manages remote compute runners, allowing the connection of self-hosted GPU clusters or on-premise machines to execute
This repository contains the code for our paper “ZeroGen: Efficient Zero-shot Learning via Dataset Generation”. Our implementation is built on the source code from dino. Thanks for their work.
Kubeflow is a Kubernetes machine learning platform and containerized toolkit designed to orchestrate the entire machine learning lifecycle. It functions as an MLOps workflow orchestrator and infrastructure layer for building, training, and deploying models within containerized environments. The project provides specialized infrastructure for scaling compute resources and managing GPU workloads for large-scale distributed training. It automates the transition of models from experimental development to production through workflow orchestration and model deployment services. The platform covers
Hopsworks - Data-Intensive AI platform with a Feature Store
Ludwig is a multimodal machine learning platform and low-code framework designed for building, training, and deploying neural networks. It enables the construction of models that process text, images, audio, and tabular data through a unified interface using declarative configuration files rather than custom code. The system features a specialized low-code framework for large language models, supporting supervised fine-tuning, preference alignment, and a constrained decoding tool to force structured data output via logit extraction. It also includes an automated model architecture search to i
NeMo is a multimodal AI framework and toolkit designed for the development, training, and scaling of large language models, generative AI systems, and speech-based models. It functions as an automatic speech recognition toolkit, a text-to-speech engine, and a framework for building models that process and generate combinations of text, image, and audio data. The project serves as a conversational AI orchestrator capable of managing real-time, interruptible voice interactions. It provides specialized workflows for speech translation, converting spoken audio from one language into text or speec
Official CLI and Python SDK for Prime Intellect - access GPU compute, remote sandboxes, RL environments, and distributed training infrastructure for AI development at scale.
PyCaret is a Python AutoML platform and MLOps lifecycle manager designed to automate machine learning workflows. It functions as a low-code environment that leverages a scikit-learn native engine to execute preprocessing, training, and evaluation for tabular data. The platform distinguishes itself as an LLM-powered ML copilot, using large language model agents to analyze datasets, design experiment configurations, and explain model results. It also serves as a Kubernetes ML orchestrator and model registry, enabling the versioning of trained pipelines and their promotion to production API endp
Ignite is a high-level training framework for PyTorch neural networks that serves as a training engine and deep learning lifecycle manager. It provides a structured system for organizing and automating training and evaluation loops, managing data iterators and triggering event handlers at specific milestones during the model training process. The project distinguishes itself through a comprehensive suite of tools for distributed training and model evaluation. It includes utilities for synchronizing gradients and coordinating collective communication across multiple GPUs or nodes, as well as a
This is the repo for the Code Alpaca project, which aims to build and share an instruction-following LLaMA model for code generation. This repo is fully based on Stanford Alpaca ,and only changes the data used for training. Training approach is the same.
This repository contains the code and models of the paper "AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation"
This is the official code for the paper Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation) (accepted to EMNLP 2023).
This repository contains the code for our paper “SunGen: Self-Guided High-Quality Data Generation in Efficient Zero-Shot Learning”.
🏕️ Reproducible development environment for humans and agents
TFX is an end-to-end platform for deploying production ML pipelines