11 repositorios
Mechanisms for importing model weights from local storage.
Distinguishing note: Focuses on local file-based model loading.
Explore 11 awesome GitHub repositories matching data & databases · Local Model Loading. Refine with filters or upvote what's useful.
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as w
Imports pre-trained model weights from local storage to perform inference without external hosting.
Open CLIP is an open source framework for training and deploying Contrastive Language-Image Pre-training models. It serves as a vision-language training framework and multimodal embedding engine that maps images and text into a shared vector space for similarity searches and zero-shot classification. The project provides a toolkit for distributed training of contrastive models and includes an image-to-text generative model for producing natural language descriptions. It supports custom text encoder integration and utilizes teacher-student model distillation to transfer knowledge from large pr
Provides mechanisms to initialize model architectures using weights stored on a local disk.
This project is an on-device AI SDK providing a framework for running large language models, vision models, and speech models locally. It serves as an orchestration layer for local LLM execution, ensuring data privacy and offline availability by utilizing hardware acceleration on the device. The SDK is distinguished by its comprehensive voice and multimodal capabilities, including a coordinated voice pipeline for activity detection, speech-to-text, and text-to-speech synthesis. It also provides a dedicated implementation kit for local retrieval-augmented generation and tools for processing co
Handles the downloading of model files from remote URLs and loading them into device memory.
MochiDiffusion is a local client for Stable Diffusion that functions as an AI image generation studio. It provides a workspace for performing text-to-image, image-to-image, and inpainting tasks, enabling the production of high-resolution images offline using local hardware and neural engine acceleration. The project includes a local model manager for importing, organizing, and converting machine learning models into compatible formats for offline execution. It features a ControlNet integration tool to guide structural composition and spatial layout, alongside a dedicated image upscaler that u
Loads machine learning weights by scanning local filesystem paths for compatible model files.
OpenPlayground is a web-based comparison playground and multi-provider client used to test and evaluate outputs from multiple large language models and local inference engines side-by-side. It serves as a local testing environment for routing prompts to various external APIs and on-device models through a single interface. The project enables concurrent request dispatching, allowing a single prompt to be sent to multiple models simultaneously for comparative analysis. It includes a parameter tuning interface for refining model behavior via generation settings and provides a system for detecti
Includes mechanisms for importing model weights from local file system storage for on-device inference.
Serge is a self-hosted web chat interface for running large language models locally using the llama.cpp inference engine. It loads GGUF-format model files directly on your own machine, removing the need for internet connectivity or external API keys, and streams responses to the browser in real time via WebSocket connections. The project is packaged for containerized deployment using Docker and Docker Compose, with a Traefik reverse proxy that handles HTTP and WebSocket routing along with automatic TLS certificate management. Ready-made Kubernetes manifests are also provided, enabling deploym
Loads GGUF-format model files from a curated list of supported open-source families for local inference.
A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
Loads GGUF-format models from a curated list for local execution through a structured pipeline.
Shimmy es un motor de inferencia y servidor de modelos de lenguaje grandes local que carga y sirve pesos en formato GGUF. Se distribuye como un runtime binario único escrito en Rust, proporcionando un entorno independiente para ejecutar modelos sin dependencias de runtime externas. El proyecto utiliza WebGPU para aceleración por hardware, permitiendo que los kernels de cómputo del modelo se ejecuten a través de diversos hardware gráficos mediante una interfaz estandarizada. Cuenta con un servidor local que implementa una capa de API compatible con OpenAI, permitiendo que las aplicaciones interactúen con modelos locales a través de endpoints REST estandarizados. La memoria y el rendimiento se gestionan mediante la compresión de caché de clave-valor cuantizada para reducir el uso de VRAM de la GPU y el escalado de embedding rotatorio para extender la ventana de contexto del modelo. El sistema también incluye descubrimiento automático de archivos de modelo para escanear y registrar pesos compatibles desde el almacenamiento local. El servidor se gestiona a través de una interfaz de línea de comandos dedicada para controlar operaciones y verificar la generación del modelo.
Provides a pipeline for discovering and loading GGUF-format models for local serving.
Este proyecto es un framework de servicio de modelos de PyTorch diseñado para desplegar y escalar modelos de machine learning en producción a través de endpoints de red escalables. Funciona como un servidor de inferencia de alto rendimiento, optimizador y gestor del ciclo de vida del modelo que maneja la carga de modelos, el procesamiento por lotes (batching) de solicitudes y la aceleración por hardware. El sistema se distingue por sus capacidades avanzadas de orquestación y optimización, como el encadenamiento de múltiples modelos en flujos de trabajo secuenciales mediante grafos de ejecución y el uso de procesamiento por lotes dinámico para mejorar el rendimiento y la latencia. Proporciona soporte especializado para IA generativa y modelos de lenguaje de gran tamaño (LLM) mediante procesamiento por lotes continuo y paralelismo de tensores. Las áreas de capacidad incluyen la gestión de recursos de GPU en hardware diverso como NVIDIA, AMD y Apple Silicon, así como una gestión integral del ciclo de vida del modelo para registro, versionado y escalado de trabajadores. También integra herramientas de observabilidad para rastrear la salud del sistema y el rendimiento del modelo mediante métricas compatibles con Prometheus. El servidor se gestiona a través de una interfaz de línea de comandos utilizada para el control del ciclo de vida y la configuración de parámetros de tiempo de ejecución.
Deno X imports model files and handlers into the runtime environment using a specified directory.
picoGPT is a lightweight, low-level runtime environment and inference engine designed to load pre-trained checkpoints and execute generative transformer model inference. It provides a minimal implementation of the generative pre-trained transformer architecture to facilitate local language model execution. The project includes a C++ machine learning library for converting model parameters and executing greedy token generation without heavy external dependencies. It handles remote asset synchronization by downloading pre-trained weights, hyperparameters, and vocabulary files from remote server
Maps saved parameter tensors from local storage directly into the active model structure.
LLM Guard is a security firewall and guardrail framework designed to scan and sanitize inputs and outputs for large language models. It functions as a proxy gateway and security layer to block prompt injections, toxicity, and sensitive data leakage while ensuring that model interactions remain compliant with organizational policies. The system distinguishes itself through a modular scanner pipeline that utilizes local model orchestration to eliminate external network dependencies. It supports real-time security filtering via streaming chunk analysis and implements a fail-fast execution model
Supports loading model weights from local directories to eliminate the need for downloading assets during startup.