Instructions to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF",
	filename="Qwen3.6-27B-Omnimerge-v4-F16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Use Docker

docker model run hf.co/ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Ollama:
```
ollama run hf.co/ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
```

Unsloth Studio

How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF to start chatting

How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
```

Lemonade

How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-27B-Omnimerge-v4-MTP-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-27B-Omnimerge-v4-MTP-GGUF

GGUF quantizations of ManniX-ITA/Qwen3.6-27B-Omnimerge-v4 with the MTP (Multi-Token Prediction) head retained for self-speculative decoding on llama.cpp mainline (PR #22673, merged 2026-05-16) and later.

Companion to the standard-decode release at ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF. The two repos contain identical merged weights — this one keeps the additional mtp.* tensors that convert_hf_to_gguf.py remaps to blk.{num_hidden_layers}.* per llama.cpp PR #22673 ("llama + spec: MTP Support", merged 2026-05-16), so --spec-type draft-mtp works out of the box. All quants made with imatrix using bartowski's calibration_datav5; imatrix.dat archived alongside the quants for reproducibility/audit.

Available Quantizations

Quantization	Size (GiB)	Notes
F16	50.90	full-precision reference
Q8_0	27.05
Q6_K	20.89	recommended speed/quality balance
Q5_K_M	18.19
Q4_K_M	15.66
IQ4_XS	14.26
IQ3_M	11.89
IQ2_M	9.54	MTP head forced to Q4_K — see note below

IQ2_M MTP-head override. The 7 MTP-head tensors (blk.64.attn_{k,q,v,output}.weight, blk.64.ffn_{down,gate,up}.weight, blk.64.nextn.eh_proj.weight) are overridden to Q4_K instead of the K-mix's default IQ2_S/IQ3_S. Reason: llama-imatrix only activates the standard text-decode path, so the MTP draft head accumulates zero importance entries; llama-quantize then refuses very-low-bit quants (IQ1_*, IQ2_*) on those tensors and bails mid-write (producing a deceptively size-correct but zero-header file). Forcing the MTP block to Q4_K is the cheap workaround — adds ~180 MB versus pure IQ2_S (9.54 GiB vs std-release IQ2_M ~9.32 GiB), keeps the file complete and the MTP path bit-exact. The other 859 tensors retain the IQ2_M K-mix. Recipe (rebuild from F16 in this repo):
llama-quantize --imatrix imatrix.dat \
  --tensor-type blk.64.attn_k.weight=q4_K \
  --tensor-type blk.64.attn_q.weight=q4_K \
  --tensor-type blk.64.attn_output.weight=q4_K \
  --tensor-type blk.64.ffn_down.weight=q4_K \
  --tensor-type blk.64.ffn_gate.weight=q4_K \
  --tensor-type blk.64.ffn_up.weight=q4_K \
  --tensor-type blk.64.nextn.eh_proj.weight=q4_K \
  Qwen3.6-27B-Omnimerge-v4-F16.gguf Qwen3.6-27B-Omnimerge-v4-IQ2_M.gguf IQ2_M
Tiers from IQ3_M and up don't have the strict imatrix requirement; the MTP block falls back gracefully without overrides.

Also published as ollama tags: mannix/omnimerge-v4-mtp.

How to Use — MTP speculative decoding

Stock llama.cpp containing PR #22673 ("llama + spec: MTP Support", merged 2026-05-16) or later. Confirmed working on commit bb28c1f. Older commits without this PR will load the weights but ignore the mtp.* head — you'll get standard decode with no error, just no speedup.

llama-server (recommended)

llama-server -m Qwen3.6-27B-Omnimerge-v4-Q6_K.gguf \
    -c 16384 -ngl 99 \
    --parallel 1 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    --port 8099

Key flags:

--spec-type draft-mtp — enables MTP self-speculative decoding using the included mtp.* head.
--spec-draft-n-max 3 — how many tokens the MTP head proposes per step. 3 is the sweet spot; higher values increase verification cost without gaining acceptance.
-c 16384 --parallel 1 — tuned for a 24 GB GPU (e.g. RTX 3090) running the Q6_K weights (≈ 21 GB) + draft buffer + KV. Bump -c to 32768+ and --parallel 2 on a 32 GB+ GPU.

llama-cli

llama-cli -m Qwen3.6-27B-Omnimerge-v4-Q6_K.gguf \
    --spec-type draft-mtp --spec-draft-n-max 3 \
    -p "Write a Python function that ..." -n 512

Without MTP

This repo also works as a drop-in for the standard release — just omit the --spec-type flag. You'll get identical pass@1 quality at standard decode speed.

Benchmark Results (Q6_K, MTP vs standard)

All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama-server running this Q6_K against the identical-weights Q6_K from the standard-decode release. Two configs evaluated:

std = standard release Q6_K, --parallel 2 -c 65536 (no spec)
MTP = this release Q6_K, --spec-type draft-mtp --spec-draft-n-max 3 --parallel 1 -c 16384

Both use --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0. Pass@1 is the lm_eval rescored number after <think>...</think>-block stripping (necessary for this reasoning model — raw lm_eval exec(prompt + completion + tests) SyntaxErrors on the literal < in <think>).

Decode tokens/sec is the aggregate decode throughput as measured by the per-completion print_timing lines on llama-server stderr (sum-of-decode-tokens ÷ sum-of-decode-seconds across the bench).

Benchmark	std Q6_K pass@1	MTP Q6_K pass@1	std tok/s (agg)	MTP tok/s (agg)	MTP speedup
HumanEval (164q, 0-shot)	83.54 % (137/164)	83.54 % (137/164)	29.85	60.22	2.02 ×
MBPP (500q, 3-shot)	73.00 % (365/500)	75.00 % (375/500)	24.33	56.75	2.33 ×
GPQA Diamond (198q, 0-shot CoT)†	78.28 % (155/198)	77.78 % (154/198)	26.24	56.59	2.16 ×

† GPQA Diamond reported as flexible-extract (the canonical metric — the model's chain-of-thought ends in a free-form answer that the regex extractor parses). Companion strict-match (exact final-token match) is 7.58 % std / 9.60 % MTP — both quite low because the model emits CoT verbosely rather than the rigid "Answer: A" template the strict matcher wants; the flex score is the real quality signal. Identical chain config to HE/MBPP (--reasoning-budget 8192, sampler greedy temperature=0, max_length=32768); std runs --parallel 2 -c 65536, MTP runs --spec-type draft-mtp --spec-draft-n-max 3 --parallel 1 -c 16384. Wall time: std 4 h 55 min, MTP 4 h 35 min.

The HE exact-match (137/164 ↔ 137/164) and GPQA near-parity (155/198 ↔ 154/198 — single-question delta well inside the ±2.94 % stderr on 198 samples) are the headline quality claims: MTP is statistically indistinguishable from std on this model. The +2 pp MBPP delta (10 problems out of 500) is at the edge of the ±2 pp rescore-stderr band and may still be (a) real because MTP's token-emission order under verification differs subtly even under greedy decoding due to tie-breaking, (b) sampling noise, or (c) an artifact of the think-strip rescore parser interacting differently with the two streams — treat it as suggestive only. The throughput win (2.0-2.3 ×) is the operational headline.

Why the speedup is 2× rather than 4×

The MTP head's draft acceptance rate measured on HumanEval was ~81 % (#acc 7678 / #gen 9478). On paper a 4-token draft (--spec-draft-n-max 3 plus the verifier-implicit base token = 4 total) at 81 % acceptance gives a per-slot speedup of ~3-4 × over single-slot non-spec decoding. We observe that exactly: MTP at 60.22 tok/s vs std-1-slot 14.9 tok/s (= std-aggregate 29.85 ÷ 2 slots) is a 4.0 × per-slot win. The 2 × aggregate speedup is because the std baseline runs --parallel 2 (two concurrent slots sharing the GPU), whereas MTP fits at --parallel 1 only on a 24 GB GPU. On a larger GPU where MTP can also run --parallel 2, the aggregate would track the per-slot 4 ×.

For single-request latency (interactive chat, code assistants, agent loops) MTP delivers the full 4 × benefit on this GPU class.

GPQA holds the same ratio (2.16 ×) despite producing much longer reasoning-tail completions — the 81 % HE acceptance rate generalizes to reasoning-heavy CoT outputs more cleanly than the cautious 1.5-2 × estimate in the earlier draft. This is the empirically validated speedup on this GPU class for both code and CoT-reasoning workloads.

Known Limitations

IQ2_M MTP head is Q4_K, not IQ2_S. See "IQ2_M MTP-head override" above. The MTP draft head receives Q4_K (imatrix-free) treatment while the rest of the model is true IQ2_M. Functionally indistinguishable from a "pure" IQ2_M for inference; the file is ~180 MB larger than the std-release IQ2_M (9.54 GiB here vs 9.32 GiB std). For interactive use this means slightly higher draft-head memory footprint, but no measurable change in speedup or quality versus a hypothetical fully-IQ2_M MTP build (which can't currently be produced without a draft-mode imatrix calibration pass).
MBPP delta vs std is not yet a confirmed quality win. See "Benchmark Results" note above — +2 pp could be noise. Quality claim is "indistinguishable from std on HE-164, near-parity on GPQA Diamond-198 (within stderr), suggestive-but-unconfirmed +2 pp on MBPP-500".
GPQA Diamond is the verified long-form reasoning data point (added 2026-05-22 in T89). MTP holds the 2.16 × speedup observed on code benches while staying statistically tied with std (154/198 vs 155/198 = ∆ −0.5 pp, well inside ±2.94 % stderr). The earlier "1.5-2 × per-slot expected on reasoning" estimate was conservative — the 81 % acceptance rate from HumanEval generalizes cleanly to GPQA's 5-15 k decode-token reasoning tails on this specific model. This may not hold on other reasoning models — re-measure if you swap weights.
Tied to a specific llama.cpp commit. All numbers in this card are measured on commit bb28c1f of llama.cpp master (post-PR #22673). Future llama.cpp updates may change the per-token throughput (better KV-attention kernels, etc.); the absolute tok/s numbers should be read as a relative comparison against std on the same commit, not as an absolute prediction for other versions.
24 GB GPU class only. All measurements are on a single RTX 3090. On smaller GPUs (16 GB) the MTP path won't fit at all — drop to a smaller quant or use the standard release. On larger GPUs (32 GB+, e.g. RTX 4090 Pro, L40, A100), the MTP path can run --parallel 2 like std and should track the per-slot 4 × win in aggregate too — but we haven't validated that here.
MTP-only consumer at time of publish: llama.cpp. Other backends are not yet wired up to the mtp.* tensors:
- Ollama has PR #15980 in active development but no stable release. Loading this GGUF in Ollama today gives standard decode (no speedup, no error).
- Llamafile has no MTP support; discussion #632 is open. Same fallback behavior — works at standard decode.
- vLLM / SGLang / TGI: do not load the mtp.* head from this GGUF. Use the source HF safetensors model with the appropriate MTP-aware engine if those backends gain support.
Vision tower compatibility unverified for MTP path. This repo contains the text-only MTP weights. For multimodal inference the standard bartowski/Qwen_Qwen3.6-27B-GGUF mmproj projector should still work, but we haven't validated that MTP draft acceptance holds when the prompt includes vision tokens. If you hit issues with --mmproj + --spec-type draft-mtp together, drop the spec flag.
max_length gotcha when reproducing. lm-eval's local-completions defaults max_length=2048 which truncates MBPP 3-shot prompts and any reasoning-budget-8192 GPQA prompt below zero residual budget → llama-server returns [invalid]. Our chain script sets max_length=32768 explicitly. If you swap in your own eval invocation, set this or the model will appear to score 0 %.

Reproducing the eval

The full chain (download std Q6_K → bench std → bench MTP → score → throughput-parse) is committed in omnimergekit/scripts/pod_v4_q6k_eval_chain.sh. Key gotchas baked in:

lm-eval local-completions defaults max_length=2048 which truncates MBPP 3-shot prompts (and any GPQA prompt) below the prompt size — leaves max_gen_toks budget negative → server returns [invalid] sentinel. The chain script sets max_length=32768 explicitly. Without this, MBPP/GPQA score 0 % despite the model working fine. Confirmed bug on lm-eval 0.4.11.
MTP server requires reduced ctx on 24 GB GPUs — -c 16384 --parallel 1 fits Q6_K (21 GB) + draft KV at ≈ 23.5 GB. Default -c 65536 --parallel 2 OOMs.
Rescore is mandatory — the raw lm_eval pass@1 under-reports by 5-10 pp because exec(prompt + "<think>...</think>" + code) SyntaxErrors. The chain's rescore_strip_think.py recovers the real number.

Original Model Card

Qwen3.6-27B-Omnimerge-v4 (MLP-passthrough)

Same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B + 3 Qwen3.6 fine-tunes, with MLP-passthrough surgery applied to defend against a fragility we discovered in Qwen3.6's reasoning-tag emission policy. Successor to ManniX-ITA/Qwen3.5-27B-Omnimerge-v2 on the newer Qwen3.6 base.

GPQA Diamond: partial result (192/198 cached, 177 matched, ≈ 84.75% pass@1). See note below — final result blocked by an aiohttp lifecycle bug in lm_eval's local-completions adapter that consistently crashes the eval on the last 6 reasoning-tail questions where responses run 9+ minutes each. HumanEval and MBPP are final.

Quantizations

Three release lines:

GGUF (`llama.cpp` / `ollama` / `text-generation-webui`)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF — 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5. imatrix.dat archived alongside the quants for reproducibility/audit.

Also published as ollama tags: mannix/omnimerge-v4.

The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).

MLX 4-bit — text-only (Apple Silicon)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit — text-only 4-bit MLX (group_size 64, 4.501 bits/weight), ~15 GB, loads via mlx_lm.load. Use this if you don't need vision and want a slightly smaller download.

from mlx_lm import load, generate
model, tokenizer = load("ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit")
print(generate(model, tokenizer, prompt="...", max_tokens=512, verbose=True))

MLX 4-bit — Vision-Language (Apple Silicon, multimodal)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit — full multimodal 4-bit MLX (group_size 64, 4.695 bits/weight — vision tower kept at higher precision), ~16 GB, loads via mlx_vlm.load. Use this for image + video input.

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)

prompt = apply_chat_template(processor, config,
    "Describe the image in detail.", num_images=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, image=["path/to/image.png"]))

Sources

Source	Weight	Role
Qwen/Qwen3.6-27B	base	base + chat template
rico03/Qwen3.6-27B-rico03	0.40	general capability
ValiantLabs/Qwen3.6-27B-Esper3.1	0.35	code + reasoning
kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor)	0.25	reasoning anchor

Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.

Benchmark Results (Q6_K quantization)

All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0 except GPQA at 0.6 to match v2's published methodology.

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs.

Benchmark	Qwen3.6 base Q6_K (bartowski)	Omnimerge-v2 (Qwen3.5 base)	Omnimerge-v4-MLP (Qwen3.6 base)	Δ vs base	Δ vs v2
HumanEval pass@1 (164q)	84.76% (139/164)	79.27%	84.76% (139/164)	0.00 pp	+5.49 pp
MBPP pass@1 (500q) — raw lm_eval	56.20%	n/a	68.40%	+12.20 pp	n/a
MBPP pass@1 (500q) — corrected*	57.60%	74.60%	73.40%	+15.80 pp	−1.20 pp
GPQA Diamond pass@1 (flex) — see ‡	not measured (∇)	69.19% (full 198q)	≈ 84.75% (partial 177q)	—	≈ +15.5 pp

Key observations:

HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
GPQA is the marquee win — ≈ +15.5 pp over v2 (which itself was +16 pp over its source models). The Qwen3.6 base brings stronger reasoning, and the merge preserves and slightly amplifies it.

∇ Skipped a base GPQA run because (a) v2's published GPQA is the canonical reference for "is this merge valuable?" — that's what we benchmark against, and (b) the same aiohttp lifecycle bug that bit our v4-MLP run would have bit a base run too.

* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.

v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.

Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.

‡ GPQA partial result (important caveat): the full lm_eval run completed 192/198 questions before crashing repeatedly on the last 6. Root cause is an aiohttp lifecycle issue in lm_eval.models.api_models.amodel_call: the at-budget reasoning responses (16384 tokens × ~9 minutes wall time) consistently outlast the aiohttp ClientSession and the resulting RuntimeError: Session is closed is unrecoverable within the same process. We restarted lm_eval 5 times across a ~4-hour window; each restart gained ~1 question before crashing on the same long-tail. Final 6 questions were not scored. The 84.75% is computed by scripts/score_gpqa_partial.py which replicates lm_eval's exact multi_choice_regex flexible-extract filter (group_select=−1, ignore_case=True, ignore_punctuation=True) over the 192 cached responses. Of those, 177 prompts matched our process_docs-replicated GPQA prompts (the 15 unmatched are minor unicode-normalization or seed-timing artifacts in the reconstruction; the 6 uncached are the at-budget tail). 150/177 correct → 84.75% partial pass@1. The unmatched 15 + uncached 6 are unlikely to swing the headline number more than ±1 pp; final result will land in the 82-86% band. We also separately patched lm_eval's api_models.py:545 UnboundLocalError bug as a prerequisite (it crashes on transient TimeoutError before outputs is assigned) — see scripts/score_gpqa_partial.py and the inline patch recipe in this repo's commit history for the exact replication.

Why "MLP-passthrough"

When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.

We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Test	Clean Qwen3.6 base	v4 (full merge, broken)	v4-MLP (this model)
`<think>` open rate (mbpp-10 isolation)	40%	80%	0%
Unclosed `</think>`	0/4	88% of opens	0/10
MBPP pass@1 (mbpp-10 isolation)	40%	20%	50%
Empty response (chat-completions)	low	80%	0/10

Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.

The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.

Compatibility

Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).

Inference works under:

transformers (BF16) — both use_cache=True and False paths
llama.cpp (GGUF) — recommended args: --reasoning-format deepseek --reasoning-budget 8192
vLLM (untested at time of publish, expected to work)

Scripts

All merge tooling is in the scripts/ directory of this repo:

Script	Purpose
`dare_ties_merge.py`	Main merger. `--method omnimerge_v2` is the published method. Auto-detects Qwen3.6 base via `config.output_gate_type` and auto-applies `--skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj'` (override with `--no-auto-mlp-skip`).
`v4_mlp_passthrough.py`	Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6.
`inspect_v4_delta.py`	Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region.
`pod_omnimerge_v4_build.sh`	Full reproducible build script (download sources, run merge, convert + quantize Q6_K).
`pod_omnimerge_v4mlp_eval_raw.sh`	Eval orchestrator: mbpp + humaneval via raw `/v1/completions`. Required for reasoning-tag-emitting models — `apply_chat_template` + deepseek extraction strips think blocks and returns empty.
`rescore_mbpp_strip_think.py`	Re-scoring tool that strips `<think>` blocks and markdown fences before `exec(code+tests)`. Recovered 25 of 158 false failures on this model's mbpp run.
`score_gpqa_partial.py`	Partial-cache GPQA scorer. Replicates lm_eval's `multi_choice_regex` flexible-extract filter exactly (group_select=−1, ignore_case, ignore_punctuation), looks up cached responses by lm_eval's `hash_args("generate_until", [prompt, gen_kwargs])` SHA-256 key, scores against ground truth. Used for the partial 84.75% above when the lm_eval run could not complete the long-tail.
`pod_v4mlp_gpqa.sh`	Full GPQA Diamond eval runner against the v4-MLP server. T=0.6, top_p=0.95, max_gen_toks=16384 (matches v2's published methodology).

Reproducing the merge

python scripts/dare_ties_merge.py \
    --method omnimerge_v2 \
    --base /path/to/Qwen3.6-27B \
    --source /path/to/Qwen3.6-rico03 \
    --source /path/to/Qwen3.6-Esper3.1 \
    --source /path/to/Qwen3.6-Opus-Reasoning-anchor \
    --weights 0.40,0.35,0.25 \
    --density 0.53 \
    --darex-q 0.75 \
    --output ./Qwen3.6-27B-Omnimerge-v4 \
    --seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)

Caveats

Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw /v1/completions for code benchmarks; chat-completions + --apply_chat_template + deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. See pod_omnimerge_v4mlp_eval_raw.sh for the working config.
MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).

Acknowledgements

Qwen team for the Qwen3.6 base
rico03, ValiantLabs, kai-os for the fine-tunes
DARE / TIES / DARE-TIES authors and the arcee-ai/mergekit community

Downloads last month: 8,053

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF

Base model

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Quantized

(5)

this model