Heretic is a specialized toolkit for removing safety alignment and refusal constraints from transformer-based language models. It utilizes directional ablation to suppress model refusals and restore unrestricted output capabilities.
The project provides a framework for quantifying the effectiveness of these modifications by measuring refusal rates and evaluating divergence from the original model behavior. It also includes a suite for residual vector analysis, allowing for the calculation of geometric relationships between prompts and the visualization of hidden states across model layers.
Additional capabilities cover model output optimization to filter stylistic clichés and the use of contrastive dataset analysis to refine ablation parameters.