# OBLITERATUS Pipeline Efficiency Audit

**Date:** 2026-03-03
**Scope:** All obliteration methods in `abliterate.py` (5,076 lines), `bayesian_optimizer.py`, `informed_pipeline.py`, and 4 ablation strategies.

---

## Executive Summary

The 6-stage pipeline (SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH) is architecturally sound with good separation of concerns. Memory hygiene between stages is correct. The rank-1 projection math is efficient. Quantization handling is robust.

**8 concrete efficiency issues found.** Estimated cumulative impact: **~40-60% wall-clock reduction** on typical runs (8B model, advanced/surgical methods). Ordered by ROI (ease × impact).

---

## HIGH PRIORITY (Fix This Week)

### 1. PROBE runs 1,536 prompts with zero batching

**Location:** `abliterate.py:1074-1088`
**Impact:** Largest single wall-clock bottleneck (~77s on 8B model, reducible to ~10s)

The activation collection loop processes each prompt individually with a full forward pass + GC cycle between each one. With 512 harmful + 512 harmless + 512 jailbreak prompts = 1,536 serial forward passes.

The `_free_gpu_memory()` call at line 1086 is **inside the per-prompt loop**, adding ~20ms × 1,536 = 30s of pure garbage collection overhead.

```python
# CURRENT (serial)
for i, prompt in enumerate(prompts):
    inputs = tokenizer(prompt, return_tensors="pt", ...)
    model(**inputs)
    del inputs
    self._free_gpu_memory()  # <-- 30s wasted
```

**Fix:** Batch prompts (batch_size=8-16). Hooks already handle batch dimension correctly via `hidden[:, -1, :]`. Move `_free_gpu_memory()` to run every N batches, not every prompt.

**Speedup:** ~7-8x on PROBE stage.

---

### 2. VERIFY generates 30 completions sequentially — no batching

**Location:** `abliterate.py:4622-4670`
**Impact:** Second-largest wall-clock cost (~57s on 8B model, reducible to ~15s)

Each of the 30 refusal-test prompts gets an independent `model.generate(max_new_tokens=128)` call. At ~15ms/token on an 8B model, that's 30 × 128 × 15ms ≈ 57s.

**Fix:** Batch the generation calls (batch_size=4-8). `model.generate()` supports batched inputs natively. The tokenizer already handles padding.

**Speedup:** ~4x on VERIFY stage.

---

### 3. SAE training is forced to CPU with no early stopping

**Location:** `abliterate.py:1579-1583`
**Impact:** Moderate — adds ~20-40s per run when SAE features are enabled (surgical, nuclear methods)

SAE training runs 30 fixed epochs per strong layer on CPU. With 15-20 strong layers, that's 450-600 CPU training epochs. No convergence check, no early stopping.

The `device="cpu"` is overly conservative — the memory-aware cap at line 1570-1578 already validates GPU headroom, and a typical SAE encoder (expansion=2, hidden_dim=4096) is only ~128MB.

**Fix:**
1. Add early stopping when reconstruction loss plateaus (< 0.1% improvement over 3 epochs)
2. Use GPU when `free_mb > sae_mem_mb + 1024` (1GB headroom)
3. Reduce default epochs from 30 to 15 with convergence guard

---

## MEDIUM PRIORITY (Fix This Sprint)

### 4. `_distill_inner()` is a degraded copy of `_distill()` — drops half the SOTA techniques

**Location:** `abliterate.py:2958-3055` vs `1102-1750`
**Impact:** Quality regression on refinement passes 2+, not pure compute waste

The iterative refinement path calls `_distill_inner()` which is a simplified ~100-line copy that skips: Wasserstein-optimal extraction, layer-adaptive strength, float layer interpolation, SAE features, EGA, CoT-aware orthogonalization, and RDO refinement.

This means "true iterative refinement" actually produces **worse directions on later passes** because it drops the analysis-guided enhancements.

**Fix:** Extract shared SVD/direction logic into `_extract_directions(full_features=True/False)` and call from both paths. At minimum, keep whitened SVD and jailbreak-contrastive blending in the inner path.

---

### 5. Bayesian optimizer clones ALL weight tensors — ~7GB memory overhead

**Location:** `bayesian_optimizer.py:300-341`
**Impact:** Memory pressure on GPU-constrained setups; 50× full-restore cycles

The optimizer saves a complete clone of every weight tensor across all strong layers. For a 7B model with 32 layers, that's ~7GB of clones sitting in memory during all 50 trials.

After each trial, `_restore_all()` copies all clones back — 50 trials × full-model memcpy.

**Fix (easy):** Only clone weights in `_strong_layers` (already partially done, but `named_parameters()` crawl still catches everything). Drop the `seen_data_ptrs` set once the loop is tightened.

**Fix (better):** Store the projection delta `Δ = scale * d @ (d^T @ W)` per layer instead of cloning the full weight. Rollback = `W += Δ`. This reduces storage from O(hidden_dim²) to O(hidden_dim) per direction per layer.

---

### 6. Norm computation in `_project_out_advanced()` traverses the full matrix twice

**Location:** `abliterate.py:3477-3486`
**Impact:** ~4,800 unnecessary full-matrix norm computations per run (8-direction surgical)

When `norm_preserve=True`, the code computes `W.norm()` before projection and `W.norm()` after projection. Each norm traverses the full weight matrix (16M elements for 4096×4096).

With 8 directions × 30 layers × 10 weight matrices = 2,400 projections → 4,800 norm calls → 77 billion unnecessary FLOPs.

**Fix:** After rank-1 update `W' = W - scale * d @ (d^T @ W)`, the new norm satisfies:
`||W'||² = ||W||² - 2·scale·||d^T @ W||² + scale²·||d^T @ W||²·||d||²`

Since `||d|| = 1`: `||W'||² = ||W||² - scale·(2 - scale)·||coeff||²`

This replaces a 16M-element norm with a single `coeff.pow(2).sum()` call (~4K FLOPs).

---

## LOW PRIORITY (Backlog)

### 7. Gram-Schmidt appears 3 times as O(n²) nested loops

**Location:** `abliterate.py:1168-1173`, `1361-1367`, `3038-3044`
**Impact:** Minimal compute but code quality issue

Three separate implementations of the same Gram-Schmidt orthogonalization with nested Python loops. With n_directions=8, it's 28 dot products per call — trivial compute but (a) DRY violation, (b) numerically inferior to `torch.linalg.qr()`.

**Fix:** Extract to `_orthogonalize_subspace(sub: Tensor) -> Tensor` using QR decomposition. Single call site, single test, better numerics.

---

### 8. Pre-EXCISE baseline KL capture re-forward-passes 100 prompts already seen in PROBE

**Location:** `abliterate.py:2313-2366`
**Impact:** ~700ms wasted (minor)

`_capture_baseline_kl_logits()` runs 100 harmless prompts through the model to capture pre-EXCISE logits. But PROBE already ran those same prompts and captured hidden states at every layer. The logits could be computed as `lm_head(last_hidden_state)` — a single matmul.

**Fix:** After PROBE, compute `baseline_logits = model.lm_head(harmful_means[last_layer])` on the cached activations. Skip the 100-prompt forward pass entirely.

---

## What's Done Well

| Area | Assessment |
|------|------------|
| **Stage-boundary memory cleanup** | Correct — `_free_gpu_memory()` + explicit dict clearing between stages |
| **Rank-1 projection math** | Efficient — `W @ d` then `d.T * coeff` instead of materializing `I - dd^T` |
| **Quantization dequant/requant** | Robust — handles bitsandbytes NF4, GPTQ, AWQ; fails loudly on unsupported formats |
| **Incremental expert mean** | Smart — Welford running mean in `_transplant_expert_weights()` avoids stacking all expert weights |
| **Router stabilization** | Defensive — `_stabilize_router_weights()` after MoE projection prevents CUDA crashes |
| **Large model mode** | Pragmatic — caps directions, SAE features, refinement passes for 120B+ models |
| **Event emission** | Clean — `_emit()` / `_on_stage()` / `_on_log()` callbacks for UI integration without coupling |

---

## Method Efficiency Comparison

| Method | PROBE Cost | DISTILL Cost | EXCISE Cost | VERIFY Cost | Primary Bottleneck |
|--------|-----------|-------------|-------------|-------------|-------------------|
| **basic** | 1x (1,024 prompts) | 1x (diff-in-means) | 1x (~10 projections) | 1x | PROBE |
| **advanced** | 2x (re-probe on pass 2) | 2x (re-distill) | 2x (2 passes) | 1x | PROBE × 2 |
| **aggressive** | 3x (re-probe on passes 2,3) | 3x (re-distill) | 3x (3 passes, 8 dirs) | 1x | PROBE × 3 |
| **surgical** | 1.5x (+jailbreak prompts) | 2x (SAE training) | 2x (head surgery + EGA) | 1x | SAE on CPU |
| **optimized** | 1.5x (+jailbreak) | 1x | 50x (Bayesian trials) | 1x | Bayesian optimizer |
| **inverted** | 1.5x (+jailbreak) | 1x | 2x (reflection math) | 1x | PROBE |
| **nuclear** | 1.5x (+jailbreak) | 2x (SAE) | 3x (all techniques) | 1x | SAE + PROBE |
| **informed** | 1x | 1.5x (analysis modules) | 1x-3x (dynamic) | 1.5x (Ouroboros check) | Analysis modules |

---

## Prioritized Action Plan

1. **Batch PROBE forward passes** — immediate 7-8x speedup on largest bottleneck
2. **Batch VERIFY generation** — immediate 4x speedup on second bottleneck
3. **Add SAE early stopping + GPU path** — 2-3x speedup on SAE-enabled methods
4. **Unify `_distill` / `_distill_inner`** — quality fix, prevents direction degradation
5. **Optimize Bayesian rollback storage** — memory fix for GPU-constrained users
6. **Analytical norm computation** — eliminates 77B unnecessary FLOPs
7. **DRY Gram-Schmidt** — code quality
8. **Cache KL baseline from PROBE** — minor speedup