11 dépôts
Mechanisms for importing model weights from local storage.
Distinguishing note: Focuses on local file-based model loading.
Explore 11 awesome GitHub repositories matching data & databases · Local Model Loading. Refine with filters or upvote what's useful.
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as w
Imports pre-trained model weights from local storage to perform inference without external hosting.
Open CLIP is an open source framework for training and deploying Contrastive Language-Image Pre-training models. It serves as a vision-language training framework and multimodal embedding engine that maps images and text into a shared vector space for similarity searches and zero-shot classification. The project provides a toolkit for distributed training of contrastive models and includes an image-to-text generative model for producing natural language descriptions. It supports custom text encoder integration and utilizes teacher-student model distillation to transfer knowledge from large pr
Provides mechanisms to initialize model architectures using weights stored on a local disk.
This project is an on-device AI SDK providing a framework for running large language models, vision models, and speech models locally. It serves as an orchestration layer for local LLM execution, ensuring data privacy and offline availability by utilizing hardware acceleration on the device. The SDK is distinguished by its comprehensive voice and multimodal capabilities, including a coordinated voice pipeline for activity detection, speech-to-text, and text-to-speech synthesis. It also provides a dedicated implementation kit for local retrieval-augmented generation and tools for processing co
Handles the downloading of model files from remote URLs and loading them into device memory.
MochiDiffusion is a local client for Stable Diffusion that functions as an AI image generation studio. It provides a workspace for performing text-to-image, image-to-image, and inpainting tasks, enabling the production of high-resolution images offline using local hardware and neural engine acceleration. The project includes a local model manager for importing, organizing, and converting machine learning models into compatible formats for offline execution. It features a ControlNet integration tool to guide structural composition and spatial layout, alongside a dedicated image upscaler that u
Loads machine learning weights by scanning local filesystem paths for compatible model files.
OpenPlayground is a web-based comparison playground and multi-provider client used to test and evaluate outputs from multiple large language models and local inference engines side-by-side. It serves as a local testing environment for routing prompts to various external APIs and on-device models through a single interface. The project enables concurrent request dispatching, allowing a single prompt to be sent to multiple models simultaneously for comparative analysis. It includes a parameter tuning interface for refining model behavior via generation settings and provides a system for detecti
Includes mechanisms for importing model weights from local file system storage for on-device inference.
Serge is a self-hosted web chat interface for running large language models locally using the llama.cpp inference engine. It loads GGUF-format model files directly on your own machine, removing the need for internet connectivity or external API keys, and streams responses to the browser in real time via WebSocket connections. The project is packaged for containerized deployment using Docker and Docker Compose, with a Traefik reverse proxy that handles HTTP and WebSocket routing along with automatic TLS certificate management. Ready-made Kubernetes manifests are also provided, enabling deploym
Loads GGUF-format model files from a curated list of supported open-source families for local inference.
A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
Loads GGUF-format models from a curated list for local execution through a structured pipeline.
Shimmy est un moteur d'inférence et serveur local pour grands modèles de langage qui charge et sert des poids au format GGUF. Il est distribué sous forme de runtime binaire unique écrit en Rust, fournissant un environnement autonome pour exécuter des modèles sans dépendances de runtime externes. Le projet utilise WebGPU pour l'accélération matérielle, permettant aux noyaux de calcul des modèles de s'exécuter sur divers matériels graphiques via une interface standardisée. Il dispose d'un serveur local qui implémente une couche API compatible avec OpenAI, permettant aux applications d'interfacer avec des modèles locaux via des endpoints REST standardisés. La mémoire et les performances sont gérées via une compression de cache clé-valeur quantifiée pour réduire l'utilisation de la VRAM GPU et une mise à l'échelle des plongements rotatifs (rotary embedding) pour étendre la fenêtre de contexte du modèle. Le système inclut également une découverte automatique des fichiers de modèles pour scanner et enregistrer les poids compatibles depuis le stockage local. Le serveur est géré via une interface en ligne de commande dédiée pour contrôler les opérations et vérifier la génération des modèles.
Provides a pipeline for discovering and loading GGUF-format models for local serving.
Ce projet est un framework de service de modèles PyTorch conçu pour déployer et mettre à l'échelle des modèles de machine learning en production via des endpoints réseau évolutifs. Il fonctionne comme un serveur d'inférence haute performance, un optimiseur et un gestionnaire de cycle de vie de modèle qui gère le chargement des modèles, le batching des requêtes et l'accélération matérielle. Le système se distingue par des capacités avancées d'orchestration et d'optimisation, telles que le chaînage de plusieurs modèles dans des workflows séquentiels utilisant des graphes d'exécution et l'utilisation du batching dynamique pour améliorer le débit et la latence. Il offre un support spécialisé pour l'IA générative et les grands modèles de langage (LLM) grâce au batching continu et au parallélisme de tenseurs. Les domaines de capacités étendus incluent la gestion des ressources GPU sur divers matériels comme NVIDIA, AMD et Apple Silicon, ainsi qu'une gestion complète du cycle de vie des modèles pour l'enregistrement, le versioning et la mise à l'échelle des workers. Il intègre également des outils d'observabilité pour suivre la santé du système et les performances des modèles via des métriques compatibles Prometheus. Le serveur est géré via une interface de ligne de commande utilisée pour le contrôle du cycle de vie et la configuration des paramètres d'exécution.
Deno X imports model files and handlers into the runtime environment using a specified directory.
picoGPT is a lightweight, low-level runtime environment and inference engine designed to load pre-trained checkpoints and execute generative transformer model inference. It provides a minimal implementation of the generative pre-trained transformer architecture to facilitate local language model execution. The project includes a C++ machine learning library for converting model parameters and executing greedy token generation without heavy external dependencies. It handles remote asset synchronization by downloading pre-trained weights, hyperparameters, and vocabulary files from remote server
Maps saved parameter tensors from local storage directly into the active model structure.
LLM Guard is a security firewall and guardrail framework designed to scan and sanitize inputs and outputs for large language models. It functions as a proxy gateway and security layer to block prompt injections, toxicity, and sensitive data leakage while ensuring that model interactions remain compliant with organizational policies. The system distinguishes itself through a modular scanner pipeline that utilizes local model orchestration to eliminate external network dependencies. It supports real-time security filtering via streaming chunk analysis and implements a fail-fast execution model
Supports loading model weights from local directories to eliminate the need for downloading assets during startup.