# OBLITERATUS Pipeline Efficiency Audit **Date:** 2026-03-03 **Scope:** All obliteration methods in `abliterate.py` (5,076 lines), `bayesian_optimizer.py`, `informed_pipeline.py`, and 4 ablation strategies. --- ## Executive Summary The 6-stage pipeline (SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH) is architecturally sound with good separation of concerns. Memory hygiene between stages is correct. The rank-1 projection math is efficient. Quantization handling is robust. **8 concrete efficiency issues found.** Estimated cumulative impact: **~40-60% wall-clock reduction** on typical runs (8B model, advanced/surgical methods). Ordered by ROI (ease × impact). --- ## HIGH PRIORITY (Fix This Week) ### 1. PROBE runs 1,536 prompts with zero batching **Location:** `abliterate.py:1074-1088` **Impact:** Largest single wall-clock bottleneck (~77s on 8B model, reducible to ~10s) The activation collection loop processes each prompt individually with a full forward pass + GC cycle between each one. With 512 harmful + 512 harmless + 512 jailbreak prompts = 1,536 serial forward passes. The `_free_gpu_memory()` call at line 1086 is **inside the per-prompt loop**, adding ~20ms × 1,536 = 30s of pure garbage collection overhead. ```python # CURRENT (serial) for i, prompt in enumerate(prompts): inputs = tokenizer(prompt, return_tensors="pt", ...) model(**inputs) del inputs self._free_gpu_memory() # <-- 30s wasted ``` **Fix:** Batch prompts (batch_size=8-16). Hooks already handle batch dimension correctly via `hidden[:, -1, :]`. Move `_free_gpu_memory()` to run every N batches, not every prompt. **Speedup:** ~7-8x on PROBE stage. --- ### 2. VERIFY generates 30 completions sequentially — no batching **Location:** `abliterate.py:4622-4670` **Impact:** Second-largest wall-clock cost (~57s on 8B model, reducible to ~15s) Each of the 30 refusal-test prompts gets an independent `model.generate(max_new_tokens=128)` call. At ~15ms/token on an 8B model, that's 30 × 128 × 15ms ≈ 57s. **Fix:** Batch the generation calls (batch_size=4-8). `model.generate()` supports batched inputs natively. The tokenizer already handles padding. **Speedup:** ~4x on VERIFY stage. --- ### 3. SAE training is forced to CPU with no early stopping **Location:** `abliterate.py:1579-1583` **Impact:** Moderate — adds ~20-40s per run when SAE features are enabled (surgical, nuclear methods) SAE training runs 30 fixed epochs per strong layer on CPU. With 15-20 strong layers, that's 450-600 CPU training epochs. No convergence check, no early stopping. The `device="cpu"` is overly conservative — the memory-aware cap at line 1570-1578 already validates GPU headroom, and a typical SAE encoder (expansion=2, hidden_dim=4096) is only ~128MB. **Fix:** 1. Add early stopping when reconstruction loss plateaus (< 0.1% improvement over 3 epochs) 2. Use GPU when `free_mb > sae_mem_mb + 1024` (1GB headroom) 3. Reduce default epochs from 30 to 15 with convergence guard --- ## MEDIUM PRIORITY (Fix This Sprint) ### 4. `_distill_inner()` is a degraded copy of `_distill()` — drops half the SOTA techniques **Location:** `abliterate.py:2958-3055` vs `1102-1750` **Impact:** Quality regression on refinement passes 2+, not pure compute waste The iterative refinement path calls `_distill_inner()` which is a simplified ~100-line copy that skips: Wasserstein-optimal extraction, layer-adaptive strength, float layer interpolation, SAE features, EGA, CoT-aware orthogonalization, and RDO refinement. This means "true iterative refinement" actually produces **worse directions on later passes** because it drops the analysis-guided enhancements. **Fix:** Extract shared SVD/direction logic into `_extract_directions(full_features=True/False)` and call from both paths. At minimum, keep whitened SVD and jailbreak-contrastive blending in the inner path. --- ### 5. Bayesian optimizer clones ALL weight tensors — ~7GB memory overhead **Location:** `bayesian_optimizer.py:300-341` **Impact:** Memory pressure on GPU-constrained setups; 50× full-restore cycles The optimizer saves a complete clone of every weight tensor across all strong layers. For a 7B model with 32 layers, that's ~7GB of clones sitting in memory during all 50 trials. After each trial, `_restore_all()` copies all clones back — 50 trials × full-model memcpy. **Fix (easy):** Only clone weights in `_strong_layers` (already partially done, but `named_parameters()` crawl still catches everything). Drop the `seen_data_ptrs` set once the loop is tightened. **Fix (better):** Store the projection delta `Δ = scale * d @ (d^T @ W)` per layer instead of cloning the full weight. Rollback = `W += Δ`. This reduces storage from O(hidden_dim²) to O(hidden_dim) per direction per layer. --- ### 6. Norm computation in `_project_out_advanced()` traverses the full matrix twice **Location:** `abliterate.py:3477-3486` **Impact:** ~4,800 unnecessary full-matrix norm computations per run (8-direction surgical) When `norm_preserve=True`, the code computes `W.norm()` before projection and `W.norm()` after projection. Each norm traverses the full weight matrix (16M elements for 4096×4096). With 8 directions × 30 layers × 10 weight matrices = 2,400 projections → 4,800 norm calls → 77 billion unnecessary FLOPs. **Fix:** After rank-1 update `W' = W - scale * d @ (d^T @ W)`, the new norm satisfies: `||W'||² = ||W||² - 2·scale·||d^T @ W||² + scale²·||d^T @ W||²·||d||²` Since `||d|| = 1`: `||W'||² = ||W||² - scale·(2 - scale)·||coeff||²` This replaces a 16M-element norm with a single `coeff.pow(2).sum()` call (~4K FLOPs). --- ## LOW PRIORITY (Backlog) ### 7. Gram-Schmidt appears 3 times as O(n²) nested loops **Location:** `abliterate.py:1168-1173`, `1361-1367`, `3038-3044` **Impact:** Minimal compute but code quality issue Three separate implementations of the same Gram-Schmidt orthogonalization with nested Python loops. With n_directions=8, it's 28 dot products per call — trivial compute but (a) DRY violation, (b) numerically inferior to `torch.linalg.qr()`. **Fix:** Extract to `_orthogonalize_subspace(sub: Tensor) -> Tensor` using QR decomposition. Single call site, single test, better numerics. --- ### 8. Pre-EXCISE baseline KL capture re-forward-passes 100 prompts already seen in PROBE **Location:** `abliterate.py:2313-2366` **Impact:** ~700ms wasted (minor) `_capture_baseline_kl_logits()` runs 100 harmless prompts through the model to capture pre-EXCISE logits. But PROBE already ran those same prompts and captured hidden states at every layer. The logits could be computed as `lm_head(last_hidden_state)` — a single matmul. **Fix:** After PROBE, compute `baseline_logits = model.lm_head(harmful_means[last_layer])` on the cached activations. Skip the 100-prompt forward pass entirely. --- ## What's Done Well | Area | Assessment | |------|------------| | **Stage-boundary memory cleanup** | Correct — `_free_gpu_memory()` + explicit dict clearing between stages | | **Rank-1 projection math** | Efficient — `W @ d` then `d.T * coeff` instead of materializing `I - dd^T` | | **Quantization dequant/requant** | Robust — handles bitsandbytes NF4, GPTQ, AWQ; fails loudly on unsupported formats | | **Incremental expert mean** | Smart — Welford running mean in `_transplant_expert_weights()` avoids stacking all expert weights | | **Router stabilization** | Defensive — `_stabilize_router_weights()` after MoE projection prevents CUDA crashes | | **Large model mode** | Pragmatic — caps directions, SAE features, refinement passes for 120B+ models | | **Event emission** | Clean — `_emit()` / `_on_stage()` / `_on_log()` callbacks for UI integration without coupling | --- ## Method Efficiency Comparison | Method | PROBE Cost | DISTILL Cost | EXCISE Cost | VERIFY Cost | Primary Bottleneck | |--------|-----------|-------------|-------------|-------------|-------------------| | **basic** | 1x (1,024 prompts) | 1x (diff-in-means) | 1x (~10 projections) | 1x | PROBE | | **advanced** | 2x (re-probe on pass 2) | 2x (re-distill) | 2x (2 passes) | 1x | PROBE × 2 | | **aggressive** | 3x (re-probe on passes 2,3) | 3x (re-distill) | 3x (3 passes, 8 dirs) | 1x | PROBE × 3 | | **surgical** | 1.5x (+jailbreak prompts) | 2x (SAE training) | 2x (head surgery + EGA) | 1x | SAE on CPU | | **optimized** | 1.5x (+jailbreak) | 1x | 50x (Bayesian trials) | 1x | Bayesian optimizer | | **inverted** | 1.5x (+jailbreak) | 1x | 2x (reflection math) | 1x | PROBE | | **nuclear** | 1.5x (+jailbreak) | 2x (SAE) | 3x (all techniques) | 1x | SAE + PROBE | | **informed** | 1x | 1.5x (analysis modules) | 1x-3x (dynamic) | 1.5x (Ouroboros check) | Analysis modules | --- ## Prioritized Action Plan 1. **Batch PROBE forward passes** — immediate 7-8x speedup on largest bottleneck 2. **Batch VERIFY generation** — immediate 4x speedup on second bottleneck 3. **Add SAE early stopping + GPU path** — 2-3x speedup on SAE-enabled methods 4. **Unify `_distill` / `_distill_inner`** — quality fix, prevents direction degradation 5. **Optimize Bayesian rollback storage** — memory fix for GPU-constrained users 6. **Analytical norm computation** — eliminates 77B unnecessary FLOPs 7. **DRY Gram-Schmidt** — code quality 8. **Cache KL baseline from PROBE** — minor speedup