Llm Guard

LLM Guard is a security firewall and guardrail framework designed to scan and sanitize inputs and outputs for large language models. It functions as a proxy gateway and security layer to block prompt injections, toxicity, and sensitive data leakage while ensuring that model interactions remain compliant with organizational policies.

The system distinguishes itself through a modular scanner pipeline that utilizes local model orchestration to eliminate external network dependencies. It supports real-time security filtering via streaming chunk analysis and implements a fail-fast execution model to reduce latency by terminating the pipeline immediately upon detecting a security violation.

The project covers a broad range of capability areas, including prompt security, output moderation, and personally identifiable information redaction. It provides tools for detecting adversarial attacks, validating output consistency and relevance, and preventing data leakage in retrieval augmented generation workflows. Additionally, it includes resource management features to prevent denial-of-service attacks through token limitation.

The security interface can be deployed as a containerized Docker image and exposes its scanning capabilities over HTTP for integration with external services.

Features

Prompt Injection Detectors - Detects and blocks prompt injection and jailbreak attempts to prevent malicious hijacking of model behavior.
AI Content Filters - Scans and filters model inputs and outputs for toxicity, bias, and harmful content to ensure safety.
LLM Prompt Injection Prevention - Scans and sanitizes user inputs to prevent prompt injections and adversarial attacks.
Local Model Execution - Executes security scans using model files stored on local disk paths to remove external network dependencies.
Local Model Orchestrators - Orchestrates the loading and execution of machine learning models from local disk to eliminate network dependencies.
Model Output Scanning - Inspects generated responses to ensure they are safe and sanitized before delivery to the end user.
Named Entity Recognition - Uses named entity recognition to identify and classify entities for the redaction of personally identifiable information.
Topic Classifiers - Blocks model outputs that address predefined sensitive subjects using a zero-shot classifier.
Text Toxicity Detection - Analyzes text for harmful or offensive language to block toxic content based on configurable thresholds.
Bias Detection Guardrails - Analyzes responses for bias and sensitivity using guardrails to prevent data leakage and maintain safety.
Sensitive Data Redaction - Replaces personally identifiable information or custom patterns in output to prevent data leakage.
Sensitive Data Leakage - Scans responses for sensitive information and removes it to prevent accidental data exposure.
Local Model Loading - Supports loading model weights from local directories to eliminate the need for downloading assets during startup.
Streaming Output Modifiers - Analyzes generated text in real-time chunks to filter streaming responses before delivery.
Content Moderation - Filters model responses for toxicity, bias, and hallucinations to ensure safe content delivery.
Content Sanitization - Uses regular expression patterns to remove or modify undesirable text within model responses.
Data Anonymization - Redacts personally identifiable information and sensitive data in prompts using NER and regular expressions.
Data Leakage Detections - Inspects generated text and streaming chunks to detect and prevent the leakage of sensitive information.
Adversarial Input Sanitization - Analyzes user input for malicious content and returns a sanitized version to ensure safe model interaction.
LLM Safety Enforcers - Acts as an HTTP API that moderates interactions by enforcing security and compliance policies.
LLM Governance and Compliance - Enforces organizational policies by blocking prohibited topics, competitor mentions, and unauthorized languages.
LLM Guardrail Frameworks - Provides a set of scanners for validating the consistency, relevance, and safety of LLM content.
LLM Input Security Scanning - Analyzes user prompts for security risks to prevent attacks or invalid inputs before they reach the model.
PII Data Leakage Prevention - Identifies and redacts sensitive personal information in prompts and responses.
Request Guards - Acts as a security guard that validates requests and blocks unsafe prompts before they reach the AI model.
Secret Detection - Scans inputs for sensitive keys and tokens and replaces them with redactions to prevent data leakage.
LLM Security - Functions as a security layer that scans inputs and outputs to block injections and toxicity.
RAG Pipeline Scanning - Scans retrieved documents in RAG workflows for sensitive information and indirect prompt injections.
Security Scanning Pipelines - Employs a modular scanner pipeline to process text through a sequence of independent security checks.
AI Output Validation - Screens AI-generated content for bias, hallucinations, and malicious URLs to ensure safety and consistency.
Agent Input Defenses - Defends autonomous agents from prompt injections to prevent unauthorized data access or harmful interactions.
Model Output Verifications - Checks for factual consistency and banned topics in model responses to ensure reliable and safe outputs.
Output Consistency Verifications - Verifies that model responses do not contradict the provided prompt to ensure logical consistency.
Output Guardrails - Identifies and flags gibberish or incoherent text using output guardrails to ensure outputs remain intelligible.
Sentiment Analysis Tools - Evaluates the emotional tone of generated text to flag content exceeding intensity thresholds.
Sensitive Word Filters - Provides filters that detect and replace banned or sensitive terms in user prompts with redaction markers.
Zero-Shot Classification Models - Uses pre-trained zero-shot classification models to categorize text into sensitive topics without task-specific training.
Regex Pattern Matching - Applies regular expression rules to sanitize text and identify prohibited substrings or specific data formats.
Real-Time Text Streaming - Inspects generated text in small sequential fragments to provide real-time security filtering for streaming responses.
Input Token Estimators - Verifies input prompt length against token thresholds to prevent resource exhaustion and denial-of-service attacks.
Pipeline Execution Controls - Implements a fail-fast model that terminates the security pipeline immediately after the first violation to reduce latency.
Programming Language Detectors - Identifies programming languages within model outputs to permit or block specific languages.
AI Agent Firewalls - Protects autonomous agents from malicious input manipulations to prevent unauthorized tool execution.
Code Execution Prevention - Removes programming code from user inputs to prevent the execution of potentially harmful scripts.
Competitor Filtering - Identifies and handles references to rival businesses using a customizable list of names to prevent inadvertent promotion.
Custom Security Scan Extensions - Provides a pluggable architecture for defining custom scanning methods to identify specific security risks.
Denial of Service Prevention - Limits the number of tokens in a request to prevent resource exhaustion and ensure system availability.
Topic-Based AI Restrictions - Restricts AI interactions to specific permitted topics using zero-shot classification to block sensitive subjects.
Input Sanitization - Filters input text using regular expression patterns to block prohibited content or enforce specific formats.
Malicious Site Blockers - Analyzes links within generated text for phishing or harmful content to prevent users from visiting dangerous sites.
Retrieval-Augmented Generation Security - Scans retrieved documents and generated answers in RAG workflows for security risks and sensitive data.
Content Moderation Filters - Filters inputs and outputs across multiple providers through a unified proxy to maintain security standards.
Named Entity Recognition Redactions - Uses named entity recognition and regular expressions to anonymize sensitive information.
Pipeline Execution Optimizations - Reduces latency and resource load via fail-fast exits, asynchronous processing, and request sampling.
Input Token Limiting - Validates input length against token thresholds to prevent resource exhaustion and denial-of-service attacks.
Invisible Character Sanitization - Identifies and sanitizes hidden unicode tag characters used to bypass security filters.
Guardrails and AI Safety - Listed in the “Guardrails and AI Safety” section of the The Incredible Pytorch awesome list.
Red Teaming and Security - Security toolkit for validating LLM input and output.
Safety and Security - Security toolkit for sanitizing and monitoring model interactions.
Security & Privacy - Comprehensive security toolkit for anonymizing and sanitizing LLM inputs/outputs.

facebookresearch/PurpleLlama

4,239View on GitHub

PurpleLlama is a collection of security toolsets and frameworks designed to audit large language model vulnerabilities and implement runtime input-output guardrails. It provides a security evaluation framework and benchmark suite to quantify risks associated with prompt injections and the generation of malicious code. The project includes a content moderator and input-output filters that use a standardized taxonomy to identify and block harmful content, jailbreaking attempts, and insecure commands. It also features capabilities for sensitive document classification to prevent the unauthorized

NVIDIA/NeMo-Guardrails

6,453View on GitHub

NeMo-Guardrails is a toolkit for adding programmable safety constraints and dialogue boundaries to large language model conversational systems. It functions as security middleware that intercepts inputs and outputs to block prompt injections, jailbreaks, and sensitive data leaks, while providing a conversational dialogue manager to define structured interaction flows through configuration files. The framework includes a hallucination filter to screen model outputs for factual accuracy and a specialized modeling language for defining conversational flows and constraints. It provides capabiliti

RunanywhereAI/runanywhere-sdks

8,781View on GitHub

This project is an on-device AI SDK providing a framework for running large language models, vision models, and speech models locally. It serves as an orchestration layer for local LLM execution, ensuring data privacy and offline availability by utilizing hardware acceleration on the device. The SDK is distinguished by its comprehensive voice and multimodal capabilities, including a coordinated voice pipeline for activity detection, speech-to-text, and text-to-speech synthesis. It also provides a dedicated implementation kit for local retrieval-augmented generation and tools for processing co

protectaillm-guard

Features

Open-source alternatives to Llm Guard

facebookresearch/PurpleLlama

NVIDIA/NeMo-Guardrails

RunanywhereAI/runanywhere-sdks

vllm-project/semantic-router

Star history

Open-source alternatives to Llm Guard

facebookresearch/PurpleLlama

NVIDIA/NeMo-Guardrails

RunanywhereAI/runanywhere-sdks

vllm-project/semantic-router