OBLITERATUS

Obliteratus is a weight ablation framework and refusal removal tool designed to identify and delete the internal representations responsible for content refusals in large language models without retraining. It functions as a circuit analysis suite that maps the geometric structure of model guardrails to isolate the specific layers and attention heads that enforce refusals.

The project enables the removal of these behaviors through geometric projection, rank-1 adapter ablation for reversible modifications, and the application of steering vectors to alter behavior during inference. It includes automation for configuring projection strengths and layer selection based on real-time analysis of model geometry.

The system covers distributed GPU processing for weight sharding and remote pipeline execution via SSH. It also provides observability tools for model coherence evaluation, measuring perplexity and refusal rates, and benchmarking removal strategies using topology charts and angular drift.

Features

Censorship Removal - Identifies and deletes internal representations that trigger content refusals to produce unrestricted responses.

Weight Ablation Frameworks - Provides a comprehensive framework for modifying model weights to remove guardrails using rank-1 adapters and projections.

Activation Steering Vectors - Modifies output behavior in real time by applying direction-specific vectors to internal model activations.

Guardrail Geometric Analysis - Maps the geometric structure of safety mechanisms to identify specific layers and components that enforce guardrails.

Attention Circuit Analysis - Provides mechanistic analysis of attention circuits to locate the specific layers and heads governing model refusals.

Directional Ablations - Uses directional ablation workflows to remove specific model behaviors without altering base weights.

Reversible Ablations - Uses rank-1 adapters to remove refusals in a reversible manner without permanent weight changes.

Model Refusal Detections - Identifies and deletes internal representations responsible for content refusal to produce unrestricted responses.

Refusal Behavior Deletion - Deletes internal representations responsible for content refusal using weight projection and decomposition without retraining.

Geometric Projection Methods - Identifies and deletes refusal representations by projecting model activations onto a subspace that isolates guardrail mechanisms.

Model Steering Tools - Provides utilities for adjusting model behavior during inference via activation interventions to bypass guardrails.

Ablation Strategy Automation - Analyzes model geometry in real time to automatically configure projection strengths and layer selection.

Guardrail Geometry Mapping - Maps the geometric structure of refusal mechanisms to identify the specific layers and components that enforce guardrails.

Distributed Model Execution - Executes model weight modification workloads across multiple distributed compute devices.

Model Coherence Evaluations - Measures perplexity, coherence, and refusal rates to ensure the model retains general capabilities after modification.

Reversible Weight Ablations - Uses rank-1 adapters to remove refusals, allowing the process to be reversed without permanent weight changes.

Component Ablation Studies - Ships a framework for disabling specific layers and attention heads to identify circuits governing model behaviors.

Multi-GPU Distribution - Implements multi-GPU sharding to overcome memory limitations when removing weights from large models.

Weight Sharding - Distributes large model weights across multiple GPU devices to manage memory footprints during ablation.

Representation Topology Benchmarking - Evaluates modification success by measuring angular drift and representation heatmaps against original model geometry.

SSH-Based Remote Execution - Coordinates model modification processes across distributed GPU nodes using secure shell connections.

Model Coherence Evaluation - Measures perplexity and refusal rates to ensure the model retains general capabilities after internal modification.

elder-pliniusOBLITERATUS

Features

Star history