We curate 18 open-source GitHub repositories matching "open source llms". Results are ranked by relevance to your query — pick filters below to narrow, or refine with AI.
Yi is a bilingual language model and foundation model designed for natural language processing, reasoning, and reading comprehension in both English and Chinese. It is built as a transformer-based architecture capable of general purpose text generation and conversational tasks. The model is distinguished by its ability to function as a long context system, processing and analyzing extended input sequences up to 200k tokens. It also supports quantized versions that use low-bit precision to reduce memory footprints, enabling execution on consumer-grade hardware. The project covers a broad rang
Yi is a publicly available bilingual language model with up to 200k-token context, quantization support for consumer hardware, and fine-tuning capabilities, directly meeting the request for a self-hostable LLM for text generation with several of the sought-after features.
The official PyTorch implementation of Google's Gemma models
This repository provides the official PyTorch implementation of Google's open-weight Gemma models, letting you self-host, fine-tune via LoRA, and run text generation with a permissive license and Hugging Face integration.
Qwen2.5 is a suite of large language model foundation models designed for natural language generation, code production, and complex mathematical reasoning. The project encompasses a multilingual language model capable of processing dozens of languages and a specialized code generation model for technical problem solving and debugging. The framework is distinguished by its long context capabilities, enabling the analysis of massive inputs ranging from 256K up to 1 million tokens. It further functions as an agentic framework, utilizing standardized templates and parsers to execute autonomous wo
Qwen2.5 offers open-weight decoder models with long-context handling (256K–1M tokens), multilingual support, and standard fine-tuning compatibility, squarely meeting the need for a self-hosted, publicly available LLM for text generation.
ChatGLM2-6B is a bilingual chat large language model designed for natural conversation and text generation in both English and Chinese. It functions as a fine-tunable language model that supports updating weights via specialized scripts to adapt to specific datasets and tasks. The project serves as a quantized inference engine and multi-GPU model orchestrator, enabling the execution of large models on consumer-grade hardware. It is capable of processing long context sequences up to 32K tokens to maintain understanding across extended documents. The system covers capabilities for multilingual
ChatGLM2-6B is an open-source bilingual LLM with publicly available weights that can be self-hosted and fine-tuned, featuring inference optimization via quantization and multi-GPU support, a long context window of 32K tokens, and integration with the Hugging Face ecosystem, making it a comprehensive fit for your text generation and fine-tuning needs.
InternLM is a large language model and a comprehensive suite of weights designed for text generation and complex reasoning. It functions as an inference engine for serving responses, a fine-tuning framework for adjusting model weights, and a platform for building autonomous AI agents. The system is capable of processing long-context input sequences up to one million tokens for document analysis. It employs chain-of-thought reasoning to solve knowledge-intensive tasks by generating intermediate logic steps before producing a final answer. The project covers model weight optimization through s
InternLM is a publicly available, self-hostable large language model with support for long-context processing (up to 1M tokens), fine-tuning, and inference optimization, making it a comprehensive fit for text generation and advanced reasoning tasks.
GLM-4 is a large language model and fine-tuning framework designed for human-like text production, complex reasoning, and multilingual conversation. It functions as a multimodal system capable of processing high-resolution visual content and as a long-context model designed to analyze documents with a context window of up to one million tokens. The project differentiates itself through a function calling interface that enables AI agent development by connecting the model to external APIs and real-time web browsing. It includes specialized capabilities for generating functional programming cod
GLM-4 is an openly available large language model with a built-in fine-tuning framework that supports parameter-efficient adapters (like LoRA), inference optimization, and a context window of up to one million tokens, making it a comprehensive choice for self-hosted text generation and customization.
This project provides a foundational framework and reference implementation for executing causal language modeling and multimodal reasoning on local systems. It includes a set of core components for managing model assets, a fine-tuning framework, and structural definitions required to instantiate transformer-based architectures. The system is distinguished by its ability to process combined text and image inputs through multimodal transformer models for visual reasoning and document analysis. It also supports the deployment of quantized models, reducing memory footprints through low-precision
This is the official Meta repository for the Llama family of LLMs, providing publicly available weights, a fine-tuning framework, inference with quantization, and long-context support — a flagship open-source model that exactly matches the request for a self-hostable, fine-tunable text generation model.
ChatGLM2-6B is an open-weight large language model designed for natural language conversations and text generation in both English and Chinese. It functions as a bilingual chat model capable of processing and maintaining coherence across text sequences up to 32K tokens. The model is optimized for local deployment through precision quantization, which reduces memory requirements to allow execution on consumer-grade hardware. It supports distributing model weights across multiple graphics cards to handle parameters that exceed the memory of a single device. The project covers capabilities for
ChatGLM2-6B is an open-weight bilingual LLM with up to 32K token context and quantization for local deployment, making it a solid fit for self-hosted text generation and fine-tuning, though the description does not explicitly confirm a permissive license or Hugging Face ecosystem integration.
This project is a natural language processing framework focused on a generalized autoregressive pretrainer designed for unsupervised language representation. It implements a language model that combines permutation-based training with a Transformer-XL backbone to function as a long-context text processor. The system is distinguished by its ability to handle text sequences that exceed standard length limits through the use of segment-level recurrence and relative positional encoding. It scales high-performance pretraining across multiple GPUs and TPU clusters using distributed training impleme
XLNet is a transformer-based language model with publicly available pretrained weights that you can self-host and fine-tune, and its Transformer-XL backbone gives it a long context window—but its parameter scale is smaller than modern LLMs and it does not prominently cover inference optimization or LoRA specifics.
This project provides a Chinese large language model based on the LLaMA architecture. It is an instruction-tuned model optimized for natural language processing and multi-turn conversations in Chinese. The system includes a framework for parameter-efficient fine-tuning using low-rank adaptation and quantization to reduce memory requirements. It also implements retrieval augmented generation for local document question answering and supports long-context processing for sequences up to 64K tokens. The project covers a broad set of capabilities including supervised instruction tuning, reinforce
This repository provides a Chinese-optimized LLaMA-2 large language model with publicly available weights, supporting LoRA fine-tuning, quantization, and 64K-token contexts, so it fits the search for a self-hostable, fine-tunable LLM for text generation.
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
MiniCPM is a collection of small, self-hostable LLMs with publicly available weights and heavy inference optimization for consumer hardware, making it a direct fit for local text generation — fine-tuning details are less emphasised but the category is right.
OLMo is an open-source large language model from AI2 with publicly available weights, supports fine-tuning including LoRA, integrates with Hugging Face, and offers various sizes and quantization options, making it a comprehensive answer for self-hosted text generation.
SmolLM is a project dedicated to the development of small language models. It focuses on training and fine-tuning compact models that maintain high performance while utilizing fewer parameters. The project emphasizes efficient AI inference and on-device text generation, aiming to enable the deployment of lightweight models on edge devices with limited memory and processing power. It utilizes synthetic data generation to produce artificial datasets that improve the reasoning and training of these AI systems. The system supports a variety of optimization and training capabilities, including we
SmolLM from Hugging Face provides openly available small language model weights that can be self-hosted, fine-tuned, and used for text generation, fitting your request for an open-source LLM with accessible weights, though its compact scale is narrower than a flagship large model.
Qwen3 is a transformer-based large language model designed as a generative AI foundation for understanding, reasoning, and generating human language. It functions as a comprehensive ecosystem for model training, fine-tuning, and production-ready inference, providing the underlying architecture and weights necessary to build diverse artificial intelligence applications. The project distinguishes itself through extensive support for model quantization and distributed inference, enabling efficient execution across a wide range of hardware from consumer-grade devices to scalable cloud infrastruct
Qwen3 is an open-source transformer-based large language model with publicly available weights that supports fine-tuning, quantization, and distributed inference, making it a solid fit for self-hosted text generation — though the provided description does not confirm permissive licensing or full feature details.
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as w
ChatGLM-6B is an open-source large language model with 6 billion parameters, publicly available weights, and a repository that provides inference and fine-tuning tooling, fitting your need for a self-hostable, fine-tunable text generation model.
ChatGLM-6B is an open-source bilingual large language model designed for natural dialogue and text generation in both English and Chinese. It is structured as a dialogue model capable of tasks such as role-playing and information extraction. The project provides implementations for quantized language models, using low-precision weights to reduce GPU memory requirements for local inference. It also supports parameter-efficient fine-tuning, allowing model behavior to be optimized for specific tasks without requiring full retraining. The model includes capabilities for local execution on GPUs a
ChatGLM-6B is an open-source bilingual LLM that can be self-hosted and fine-tuned for text generation, with support for quantized inference and parameter-efficient fine-tuning, making it a solid fit for this search—though its license is not fully permissive and its context window is moderate.
DeepSeek-V3 is a large language model that provides comprehensive resources for model utilization, including technical specifications, pre-trained weights, and evaluation benchmarks. The project details the core transformer architecture, including parameter counts and multi-token prediction modules, while supporting native 8-bit floating-point quantization. The repository offers extensive support for local and distributed inference through integration with multiple frameworks and engines. It includes documentation for deploying the model across various hardware configurations, such as GPUs an
DeepSeek-V3 is a large language model with publicly available pre-trained weights and support for local inference and deployment, making it a strong fit for self-hosted text generation and fine-tuning, though the provided description does not explicitly mention LoRA or Hugging Face integration.
Long Llama is a transformer-based language model and fine-tuning framework designed to process and maintain logical coherence across input sequences that significantly exceed standard length limits. By utilizing a focused transformer architecture, the project enables models to handle massive documents or entire books by training attention layers to track distant tokens. The framework distinguishes itself through specialized attention mechanisms that allow for the processing of hundreds of thousands of tokens. It incorporates memory-efficient inference techniques, such as key-value caching and
LongLLaMA is an open-source large language model specifically fine-tuned for handling long contexts, which directly meets the core requirement for a self-hostable text generation model with publicly available weights, though it does not explicitly cover LoRA fine-tuning or inference optimization.