30 open-source projects similar to nebuly-ai/nebullvm, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Nebullvm alternative.
Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency. The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model executi
ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts. The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
Cortex is a Kubernetes-based machine learning infrastructure platform designed for deploying, scaling, and managing models and workloads. It functions as a serverless inference engine and GPU cluster orchestrator, providing the tools necessary to execute real-time, asynchronous, and batch model predictions. The platform utilizes declarative infrastructure-as-code for provisioning model clusters and environments. It optimizes operational costs by elastically scaling CPU and GPU resources through the use of spot instances. The system covers a broad set of operational capabilities, including wo
This project is a quantized fine-tuning framework for large language models. It implements a low-rank adaptation library and a four-bit quantizer to reduce the GPU memory requirements needed to train large models. The framework utilizes four-bit quantization and low-rank adapters to enable model training on consumer-grade hardware. It further reduces the memory footprint through double quantization and a paged optimizer that offloads states to system RAM. The system supports distributed training across multiple GPUs to handle larger parameter scales and includes utilities for custom dataset
This project provides a foundational framework and reference implementation for executing causal language modeling and multimodal reasoning on local systems. It includes a set of core components for managing model assets, a fine-tuning framework, and structural definitions required to instantiate transformer-based architectures. The system is distinguished by its ability to process combined text and image inputs through multimodal transformer models for visual reasoning and document analysis. It also supports the deployment of quantized models, reducing memory footprints through low-precision
SmolLM is a project dedicated to the development of small language models. It focuses on training and fine-tuning compact models that maintain high performance while utilizing fewer parameters. The project emphasizes efficient AI inference and on-device text generation, aiming to enable the deployment of lightweight models on edge devices with limited memory and processing power. It utilizes synthetic data generation to produce artificial datasets that improve the reasoning and training of these AI systems. The system supports a variety of optimization and training capabilities, including we
PyTorch Lightning is a high-level deep learning framework for PyTorch that automates training loops and removes repetitive engineering boilerplate. It functions as a structured pipeline for managing machine learning experiments, providing a distributed training orchestrator and tools for mixed-precision training. The framework decouples scientific model architecture from the engineering required for infrastructure and scaling. This separation allows the same model code to execute across CPUs, GPUs, or TPUs through a hardware-agnostic execution engine and a centralized trainer that manages the
TensorRT is a deep learning inference engine and software development kit designed to optimize and deploy neural networks for high-performance execution on NVIDIA GPUs. It functions as a GPU acceleration framework that reduces latency and increases throughput for trained models during production deployment. The toolkit imports models from the Open Neural Network Exchange format and transforms them into optimized engines. It utilizes graph-based model optimization, layer-fusion kernel generation, and precision-based quantization to convert floating point weights into lower precision formats.
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
HAMi is a hardware orchestration and virtualization system designed to manage accelerators within Kubernetes. It functions as a device plugin that partitions physical hardware into isolated virtual slices, enabling multiple containers to share a single device through enforced memory limits and compute quotas. The project provides a virtualization manager and a heterogeneous compute scheduler that distributes tasks across diverse accelerator types. It uses packing and topology policies to optimize workload placement and allows for specific hardware targeting using unique device identifiers. T
DiffusionBee is a Stable Diffusion desktop client for macOS that functions as an AI image generator and editor. It allows for the local generation of images from text prompts and the management of diffusion models without requiring external cloud services or technical setup. The application includes a local diffusion model manager for importing and switching between custom trained model files to achieve specific artistic styles. It also features a system for tracking generation history and uploading assets to a public gallery. The software covers several image synthesis and manipulation work
LMOps is a research-driven operations framework for optimizing the deployment, fine-tuning, and performance of large language models. It provides a specialized toolkit for foundation model adaptation, inference acceleration, prompt optimization, and context orchestration. The framework distinguishes itself through an inference accelerator that reduces token generation latency by verifying and copying overlapping text spans from reference documents. It also features a prompt engineering optimizer that employs reinforcement learning, beam search, and non-natural language markers to automaticall
ai-toolkit is a diffusion model training toolkit designed for fine-tuning image and video generation models. It functions as a containerized model trainer and GPU training job manager, providing the infrastructure to orchestrate dependencies and manage training processes on remote GPU hardware. The system utilizes low-rank adaptation techniques, including LoRA and LoKr weight optimization, to reduce the hardware requirements for model training. It distinguishes itself through a web-based training controller that allows for the monitoring and modification of hyperparameters, secured by token-b
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
Axolotl is a distributed training orchestrator and fine-tuning framework for large language models, multimodal systems, and quantized models. It provides a structured environment for specializing pre-trained models through full parameter updates or low-rank adaptation, as well as aligning model outputs with human expectations via preference tuning pipelines and reward modeling. The system distinguishes itself through a configuration-driven pipeline that manages preprocessing and training workflows via a single file for reproducibility. It implements high-throughput optimizations such as multi
DeepSpeedExamples is a collection of reference implementations and scripts for training, fine-tuning, and executing inference on large-scale AI models using DeepSpeed optimization. It provides a distributed model training guide and practical workflows for adapting large language models through memory-efficient techniques. The repository includes specialized implementations for pipeline parallelism to handle models exceeding single GPU memory and a suite of examples for ZeRO memory optimization to reduce per-device overhead. It also features standardized test suites for benchmarking the throug
This project is a PyTorch model serving framework designed to deploy and scale machine learning models in production via scalable network endpoints. It functions as a high-performance inference server, optimizer, and model lifecycle manager that handles model loading, request batching, and hardware acceleration. The system distinguishes itself through advanced orchestration and optimization capabilities, such as chaining multiple models into sequential workflows using execution graphs and employing dynamic batching to improve throughput and latency. It provides specialized support for generat
FLAML is an automated machine learning framework, hyperparameter optimization tool, and large language model agent orchestrator. It provides a system for model selection and tuning across various learners and datasets, while also offering a toolkit for optimizing the inference parameters and fine-tuning settings of large language models. The project features a meta-learning tuning system that analyzes historical task data to generate data-dependent default configurations, accelerating model convergence. It further enables the design of collaborative multi-agent systems through conversational
This project is a framework for fine-tuning large language models using parameter-efficient training techniques. It provides a structured pipeline for adapting pre-trained transformer models to specific tasks while minimizing the computational resources and memory required during the training process. The system distinguishes itself by utilizing low-rank adaptation, which injects trainable rank-decomposition matrices into frozen transformer layers. By updating only this small subset of injected parameters rather than the entire model, the framework reduces the overhead associated with gradien
This repository is a collection of frameworks and guides for Llama models, functioning as a fine-tuning framework, an inference pipeline, and an AI workflow orchestrator. It provides tools for adapting large language models to specific datasets and domains. The project includes a parameter-efficient fine-tuning toolkit that utilizes techniques like low-rank adaptation to reduce memory and compute requirements. It also serves as an implementation guide for retrieval-augmented generation, combining model inference with external data retrieval to improve response accuracy. The capability surfac
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
ART is a platform for agentic training, providing a reinforcement learning framework, training environment, and compute orchestrator. It enables the improvement of multi-step agent reasoning and tool usage through group relative policy optimization and a judge-based reward modeling system. The project features tools for model distillation to transfer capabilities from large teacher models to smaller architectures, as well as a system for capturing execution trajectories to generate synthetic training data. It supports specialized training workflows including supervised fine-tuning for baselin
This project is a comprehensive technical course study guide and reference for learning the architectures and training methods of Transformers and large language models. It serves as a technical overview for understanding how neural networks process data and how to align model behavior with specific performance goals. The repository provides specialized guides on several key areas of model development. This includes detailed references for transformer architectures, implementation frameworks for retrieval-augmented generation and agentic workflows, and technical guides for model optimization
OpenNMT-py is a PyTorch neural machine translation framework used for training and deploying neural machine translation and large language models. It functions as a distributed model training system, an inference engine, and a toolkit for fine-tuning large language models. The framework distinguishes itself with a dedicated toolkit for adapting large language models through low-rank adaptation, quantization, and instruction tuning. It also includes a neural machine translation server that allows trained models to be hosted and exposed via REST API endpoints. The project covers a broad range
Starcoder is a large language model and associated framework designed to generate, complete, and evaluate source code across multiple programming languages. It functions as a source code model that can produce complete function implementations and predict subsequent characters in a line of code based on provided prompts. The project provides a specialized toolkit for adapting base models to specific coding tasks and instruction-following behaviors. This includes a conversational code assistant framework for training models to generate code via natural language chat, as well as a parameter-eff
This repository provides a collection of reference implementations and code examples for training and deploying machine learning models using the MLX framework. It serves as a practical guide for executing distributed training, fine-tuning large language models, converting model weights, and implementing multimodal generative workflows. The project distinguishes itself through specialized examples for local hardware execution, featuring weight quantization to reduce memory usage and low-rank adaptation for parameter-efficient fine-tuning. It also includes scripts for transforming external mod
OpenMythos is a framework for implementing recurrent large language model architectures. It utilizes recurrent transformer blocks to enable compute-adaptive reasoning and variable processing depth through multiple iterative passes over the same weights. The system features a mixture of experts framework that routes tokens between shared and specialized layers to optimize parameter usage. It also includes parameter-efficient fine-tuning tools using low-rank adaptation modules to modify model behavior with minimal weight updates. The framework covers distributed training pipelines using data p
LLM4Decompile is a toolset and framework for binary-to-source code translation. It uses large language models to transform machine code into readable source code and recover the original logic of compiled executables. The project includes a specialized pipeline for generating synthetic training datasets by converting source code into assembly pairs. It provides a fine-tuning framework to optimize deep learning models on these binary-to-source datasets, increasing the accuracy of code recovery. The system also features capabilities for refining decompiled pseudo-code. This process focuses on