Instructions to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF", filename="Qwen3.6-27B-Omnimerge-v4-F16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Use Docker
docker model run hf.co/ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Ollama:
ollama run hf.co/ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
- Unsloth Studio
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF to start chatting
- Pi
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Docker Model Runner:
docker model run hf.co/ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
- Lemonade
How to use ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-27B-Omnimerge-v4-MTP-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3.6-27B-Omnimerge-v4-MTP-GGUF
GGUF quantizations of ManniX-ITA/Qwen3.6-27B-Omnimerge-v4 with the MTP (Multi-Token Prediction) head retained for self-speculative decoding on llama.cpp mainline (PR #22673, merged 2026-05-16) and later.
Companion to the standard-decode release at ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF. The two repos contain identical merged weights — this one keeps the additional mtp.* tensors that convert_hf_to_gguf.py remaps to blk.{num_hidden_layers}.* per llama.cpp PR #22673 ("llama + spec: MTP Support", merged 2026-05-16), so --spec-type draft-mtp works out of the box. All quants made with imatrix using bartowski's calibration_datav5; imatrix.dat archived alongside the quants for reproducibility/audit.
Available Quantizations
| Quantization | Size (GiB) | Notes |
|---|---|---|
| F16 | 50.90 | full-precision reference |
| Q8_0 | 27.05 | |
| Q6_K | 20.89 | recommended speed/quality balance |
| Q5_K_M | 18.19 | |
| Q4_K_M | 15.66 | |
| IQ4_XS | 14.26 | |
| IQ3_M | 11.89 | |
| IQ2_M | 9.54 | MTP head forced to Q4_K — see note below |
IQ2_M MTP-head override. The 7 MTP-head tensors (
blk.64.attn_{k,q,v,output}.weight,blk.64.ffn_{down,gate,up}.weight,blk.64.nextn.eh_proj.weight) are overridden toQ4_Kinstead of the K-mix's default IQ2_S/IQ3_S. Reason:llama-imatrixonly activates the standard text-decode path, so the MTP draft head accumulates zero importance entries;llama-quantizethen refuses very-low-bit quants (IQ1_*, IQ2_*) on those tensors and bails mid-write (producing a deceptively size-correct but zero-header file). Forcing the MTP block to Q4_K is the cheap workaround — adds ~180 MB versus pure IQ2_S (9.54 GiB vs std-release IQ2_M ~9.32 GiB), keeps the file complete and the MTP path bit-exact. The other 859 tensors retain the IQ2_M K-mix. Recipe (rebuild fromF16in this repo):llama-quantize --imatrix imatrix.dat \ --tensor-type blk.64.attn_k.weight=q4_K \ --tensor-type blk.64.attn_q.weight=q4_K \ --tensor-type blk.64.attn_output.weight=q4_K \ --tensor-type blk.64.ffn_down.weight=q4_K \ --tensor-type blk.64.ffn_gate.weight=q4_K \ --tensor-type blk.64.ffn_up.weight=q4_K \ --tensor-type blk.64.nextn.eh_proj.weight=q4_K \ Qwen3.6-27B-Omnimerge-v4-F16.gguf Qwen3.6-27B-Omnimerge-v4-IQ2_M.gguf IQ2_MTiers from IQ3_M and up don't have the strict imatrix requirement; the MTP block falls back gracefully without overrides.
Also published as ollama tags: mannix/omnimerge-v4-mtp.
How to Use — MTP speculative decoding
Stock llama.cpp containing PR #22673 ("llama + spec: MTP Support", merged 2026-05-16) or later. Confirmed working on commit bb28c1f. Older commits without this PR will load the weights but ignore the mtp.* head — you'll get standard decode with no error, just no speedup.
llama-server (recommended)
llama-server -m Qwen3.6-27B-Omnimerge-v4-Q6_K.gguf \
-c 16384 -ngl 99 \
--parallel 1 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--reasoning-format deepseek --reasoning-budget 8192 \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--port 8099
Key flags:
--spec-type draft-mtp— enables MTP self-speculative decoding using the includedmtp.*head.--spec-draft-n-max 3— how many tokens the MTP head proposes per step. 3 is the sweet spot; higher values increase verification cost without gaining acceptance.-c 16384 --parallel 1— tuned for a 24 GB GPU (e.g. RTX 3090) running the Q6_K weights (≈ 21 GB) + draft buffer + KV. Bump-cto 32768+ and--parallel 2on a 32 GB+ GPU.
llama-cli
llama-cli -m Qwen3.6-27B-Omnimerge-v4-Q6_K.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
-p "Write a Python function that ..." -n 512
Without MTP
This repo also works as a drop-in for the standard release — just omit the --spec-type flag. You'll get identical pass@1 quality at standard decode speed.
Benchmark Results (Q6_K, MTP vs standard)
All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama-server running this Q6_K against the identical-weights Q6_K from the standard-decode release. Two configs evaluated:
- std = standard release Q6_K,
--parallel 2 -c 65536(no spec) - MTP = this release Q6_K,
--spec-type draft-mtp --spec-draft-n-max 3 --parallel 1 -c 16384
Both use --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0. Pass@1 is the lm_eval rescored number after <think>...</think>-block stripping (necessary for this reasoning model — raw lm_eval exec(prompt + completion + tests) SyntaxErrors on the literal < in <think>).
Decode tokens/sec is the aggregate decode throughput as measured by the per-completion print_timing lines on llama-server stderr (sum-of-decode-tokens ÷ sum-of-decode-seconds across the bench).
| Benchmark | std Q6_K pass@1 | MTP Q6_K pass@1 | std tok/s (agg) | MTP tok/s (agg) | MTP speedup |
|---|---|---|---|---|---|
| HumanEval (164q, 0-shot) | 83.54 % (137/164) | 83.54 % (137/164) | 29.85 | 60.22 | 2.02 × |
| MBPP (500q, 3-shot) | 73.00 % (365/500) | 75.00 % (375/500) | 24.33 | 56.75 | 2.33 × |
| GPQA Diamond (198q, 0-shot CoT)† | 78.28 % (155/198) | 77.78 % (154/198) | 26.24 | 56.59 | 2.16 × |
† GPQA Diamond reported as flexible-extract (the canonical metric — the model's chain-of-thought ends in a free-form answer that the regex extractor parses). Companion strict-match (exact final-token match) is 7.58 % std / 9.60 % MTP — both quite low because the model emits CoT verbosely rather than the rigid "Answer: A" template the strict matcher wants; the flex score is the real quality signal. Identical chain config to HE/MBPP (--reasoning-budget 8192, sampler greedy temperature=0, max_length=32768); std runs --parallel 2 -c 65536, MTP runs --spec-type draft-mtp --spec-draft-n-max 3 --parallel 1 -c 16384. Wall time: std 4 h 55 min, MTP 4 h 35 min.
The HE exact-match (137/164 ↔ 137/164) and GPQA near-parity (155/198 ↔ 154/198 — single-question delta well inside the ±2.94 % stderr on 198 samples) are the headline quality claims: MTP is statistically indistinguishable from std on this model. The +2 pp MBPP delta (10 problems out of 500) is at the edge of the ±2 pp rescore-stderr band and may still be (a) real because MTP's token-emission order under verification differs subtly even under greedy decoding due to tie-breaking, (b) sampling noise, or (c) an artifact of the think-strip rescore parser interacting differently with the two streams — treat it as suggestive only. The throughput win (2.0-2.3 ×) is the operational headline.
Why the speedup is 2× rather than 4×
The MTP head's draft acceptance rate measured on HumanEval was ~81 % (#acc 7678 / #gen 9478). On paper a 4-token draft (--spec-draft-n-max 3 plus the verifier-implicit base token = 4 total) at 81 % acceptance gives a per-slot speedup of ~3-4 × over single-slot non-spec decoding. We observe that exactly: MTP at 60.22 tok/s vs std-1-slot 14.9 tok/s (= std-aggregate 29.85 ÷ 2 slots) is a 4.0 × per-slot win. The 2 × aggregate speedup is because the std baseline runs --parallel 2 (two concurrent slots sharing the GPU), whereas MTP fits at --parallel 1 only on a 24 GB GPU. On a larger GPU where MTP can also run --parallel 2, the aggregate would track the per-slot 4 ×.
For single-request latency (interactive chat, code assistants, agent loops) MTP delivers the full 4 × benefit on this GPU class.
GPQA holds the same ratio (2.16 ×) despite producing much longer reasoning-tail completions — the 81 % HE acceptance rate generalizes to reasoning-heavy CoT outputs more cleanly than the cautious 1.5-2 × estimate in the earlier draft. This is the empirically validated speedup on this GPU class for both code and CoT-reasoning workloads.
Known Limitations
IQ2_M MTP head is Q4_K, not IQ2_S. See "IQ2_M MTP-head override" above. The MTP draft head receives Q4_K (imatrix-free) treatment while the rest of the model is true IQ2_M. Functionally indistinguishable from a "pure" IQ2_M for inference; the file is ~180 MB larger than the std-release IQ2_M (9.54 GiB here vs 9.32 GiB std). For interactive use this means slightly higher draft-head memory footprint, but no measurable change in speedup or quality versus a hypothetical fully-IQ2_M MTP build (which can't currently be produced without a draft-mode imatrix calibration pass).
MBPP delta vs std is not yet a confirmed quality win. See "Benchmark Results" note above — +2 pp could be noise. Quality claim is "indistinguishable from std on HE-164, near-parity on GPQA Diamond-198 (within stderr), suggestive-but-unconfirmed +2 pp on MBPP-500".
GPQA Diamond is the verified long-form reasoning data point (added 2026-05-22 in T89). MTP holds the 2.16 × speedup observed on code benches while staying statistically tied with std (154/198 vs 155/198 = ∆ −0.5 pp, well inside ±2.94 % stderr). The earlier "1.5-2 × per-slot expected on reasoning" estimate was conservative — the 81 % acceptance rate from HumanEval generalizes cleanly to GPQA's 5-15 k decode-token reasoning tails on this specific model. This may not hold on other reasoning models — re-measure if you swap weights.
Tied to a specific llama.cpp commit. All numbers in this card are measured on commit
bb28c1fof llama.cpp master (post-PR #22673). Future llama.cpp updates may change the per-token throughput (better KV-attention kernels, etc.); the absolute tok/s numbers should be read as a relative comparison against std on the same commit, not as an absolute prediction for other versions.24 GB GPU class only. All measurements are on a single RTX 3090. On smaller GPUs (16 GB) the MTP path won't fit at all — drop to a smaller quant or use the standard release. On larger GPUs (32 GB+, e.g. RTX 4090 Pro, L40, A100), the MTP path can run
--parallel 2like std and should track the per-slot 4 × win in aggregate too — but we haven't validated that here.MTP-only consumer at time of publish: llama.cpp. Other backends are not yet wired up to the
mtp.*tensors:- Ollama has PR #15980 in active development but no stable release. Loading this GGUF in Ollama today gives standard decode (no speedup, no error).
- Llamafile has no MTP support; discussion #632 is open. Same fallback behavior — works at standard decode.
- vLLM / SGLang / TGI: do not load the
mtp.*head from this GGUF. Use the source HF safetensors model with the appropriate MTP-aware engine if those backends gain support.
Vision tower compatibility unverified for MTP path. This repo contains the text-only MTP weights. For multimodal inference the standard
bartowski/Qwen_Qwen3.6-27B-GGUFmmproj projector should still work, but we haven't validated that MTP draft acceptance holds when the prompt includes vision tokens. If you hit issues with--mmproj+--spec-type draft-mtptogether, drop the spec flag.max_lengthgotcha when reproducing. lm-eval'slocal-completionsdefaultsmax_length=2048which truncates MBPP 3-shot prompts and any reasoning-budget-8192 GPQA prompt below zero residual budget → llama-server returns[invalid]. Our chain script setsmax_length=32768explicitly. If you swap in your own eval invocation, set this or the model will appear to score 0 %.
Reproducing the eval
The full chain (download std Q6_K → bench std → bench MTP → score → throughput-parse) is committed in omnimergekit/scripts/pod_v4_q6k_eval_chain.sh. Key gotchas baked in:
- lm-eval
local-completionsdefaultsmax_length=2048which truncates MBPP 3-shot prompts (and any GPQA prompt) below the prompt size — leavesmax_gen_toksbudget negative → server returns[invalid]sentinel. The chain script setsmax_length=32768explicitly. Without this, MBPP/GPQA score 0 % despite the model working fine. Confirmed bug on lm-eval 0.4.11. - MTP server requires reduced ctx on 24 GB GPUs —
-c 16384 --parallel 1fits Q6_K (21 GB) + draft KV at ≈ 23.5 GB. Default-c 65536 --parallel 2OOMs. - Rescore is mandatory — the raw lm_eval pass@1 under-reports by 5-10 pp because
exec(prompt + "<think>...</think>" + code)SyntaxErrors. The chain'srescore_strip_think.pyrecovers the real number.
Original Model Card
Qwen3.6-27B-Omnimerge-v4 (MLP-passthrough)
Same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B + 3 Qwen3.6 fine-tunes, with MLP-passthrough surgery applied to defend against a fragility we discovered in Qwen3.6's reasoning-tag emission policy. Successor to ManniX-ITA/Qwen3.5-27B-Omnimerge-v2 on the newer Qwen3.6 base.
GPQA Diamond: partial result (192/198 cached, 177 matched, ≈ 84.75% pass@1). See note below — final result blocked by an aiohttp lifecycle bug in
lm_eval'slocal-completionsadapter that consistently crashes the eval on the last 6 reasoning-tail questions where responses run 9+ minutes each. HumanEval and MBPP are final.
Quantizations
Three release lines:
GGUF (llama.cpp / ollama / text-generation-webui)
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF— 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5.imatrix.datarchived alongside the quants for reproducibility/audit.
Also published as ollama tags: mannix/omnimerge-v4.
The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).
MLX 4-bit — text-only (Apple Silicon)
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit— text-only 4-bit MLX (group_size 64, 4.501 bits/weight), ~15 GB, loads viamlx_lm.load. Use this if you don't need vision and want a slightly smaller download.
from mlx_lm import load, generate
model, tokenizer = load("ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit")
print(generate(model, tokenizer, prompt="...", max_tokens=512, verbose=True))
MLX 4-bit — Vision-Language (Apple Silicon, multimodal)
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit— full multimodal 4-bit MLX (group_size 64, 4.695 bits/weight — vision tower kept at higher precision), ~16 GB, loads viamlx_vlm.load. Use this for image + video input.
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)
prompt = apply_chat_template(processor, config,
"Describe the image in detail.", num_images=1)
print(generate(model, processor, prompt,
max_tokens=512, verbose=True, image=["path/to/image.png"]))
Sources
| Source | Weight | Role |
|---|---|---|
| Qwen/Qwen3.6-27B | base | base + chat template |
| rico03/Qwen3.6-27B-rico03 | 0.40 | general capability |
| ValiantLabs/Qwen3.6-27B-Esper3.1 | 0.35 | code + reasoning |
| kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor) | 0.25 | reasoning anchor |
Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.
Benchmark Results (Q6_K quantization)
All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0 except GPQA at 0.6 to match v2's published methodology.
v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)
All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs.
| Benchmark | Qwen3.6 base Q6_K (bartowski) | Omnimerge-v2 (Qwen3.5 base) | Omnimerge-v4-MLP (Qwen3.6 base) | Δ vs base | Δ vs v2 |
|---|---|---|---|---|---|
| HumanEval pass@1 (164q) | 84.76% (139/164) | 79.27% | 84.76% (139/164) | 0.00 pp | +5.49 pp |
| MBPP pass@1 (500q) — raw lm_eval | 56.20% | n/a | 68.40% | +12.20 pp | n/a |
| MBPP pass@1 (500q) — corrected* | 57.60% | 74.60% | 73.40% | +15.80 pp | −1.20 pp |
| GPQA Diamond pass@1 (flex) — see ‡ | not measured (∇) | 69.19% (full 198q) | ≈ 84.75% (partial 177q) | — | ≈ +15.5 pp |
Key observations:
- HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
- MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
- GPQA is the marquee win — ≈ +15.5 pp over v2 (which itself was +16 pp over its source models). The Qwen3.6 base brings stronger reasoning, and the merge preserves and slightly amplifies it.
∇ Skipped a base GPQA run because (a) v2's published GPQA is the canonical reference for "is this merge valuable?" — that's what we benchmark against, and (b) the same aiohttp lifecycle bug that bit our v4-MLP run would have bit a base run too.
* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.
- v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
- Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
- v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.
Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.
‡ GPQA partial result (important caveat): the full lm_eval run completed 192/198 questions before crashing repeatedly on the last 6. Root cause is an aiohttp lifecycle issue in lm_eval.models.api_models.amodel_call: the at-budget reasoning responses (16384 tokens × ~9 minutes wall time) consistently outlast the aiohttp ClientSession and the resulting RuntimeError: Session is closed is unrecoverable within the same process. We restarted lm_eval 5 times across a ~4-hour window; each restart gained ~1 question before crashing on the same long-tail. Final 6 questions were not scored. The 84.75% is computed by scripts/score_gpqa_partial.py which replicates lm_eval's exact multi_choice_regex flexible-extract filter (group_select=−1, ignore_case=True, ignore_punctuation=True) over the 192 cached responses. Of those, 177 prompts matched our process_docs-replicated GPQA prompts (the 15 unmatched are minor unicode-normalization or seed-timing artifacts in the reconstruction; the 6 uncached are the at-budget tail). 150/177 correct → 84.75% partial pass@1. The unmatched 15 + uncached 6 are unlikely to swing the headline number more than ±1 pp; final result will land in the 82-86% band. We also separately patched lm_eval's api_models.py:545 UnboundLocalError bug as a prerequisite (it crashes on transient TimeoutError before outputs is assigned) — see scripts/score_gpqa_partial.py and the inline patch recipe in this repo's commit history for the exact replication.
Why "MLP-passthrough"
When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.
We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.
Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations
| Test | Clean Qwen3.6 base | v4 (full merge, broken) | v4-MLP (this model) |
|---|---|---|---|
<think> open rate (mbpp-10 isolation) |
40% | 80% | 0% |
Unclosed </think> |
0/4 | 88% of opens | 0/10 |
| MBPP pass@1 (mbpp-10 isolation) | 40% | 20% | 50% |
| Empty response (chat-completions) | low | 80% | 0/10 |
Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.
The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.
Compatibility
Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).
Inference works under:
transformers(BF16) — bothuse_cache=TrueandFalsepathsllama.cpp(GGUF) — recommended args:--reasoning-format deepseek --reasoning-budget 8192- vLLM (untested at time of publish, expected to work)
Scripts
All merge tooling is in the scripts/ directory of this repo:
| Script | Purpose |
|---|---|
dare_ties_merge.py |
Main merger. --method omnimerge_v2 is the published method. Auto-detects Qwen3.6 base via config.output_gate_type and auto-applies --skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj' (override with --no-auto-mlp-skip). |
v4_mlp_passthrough.py |
Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6. |
inspect_v4_delta.py |
Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region. |
pod_omnimerge_v4_build.sh |
Full reproducible build script (download sources, run merge, convert + quantize Q6_K). |
pod_omnimerge_v4mlp_eval_raw.sh |
Eval orchestrator: mbpp + humaneval via raw /v1/completions. Required for reasoning-tag-emitting models — apply_chat_template + deepseek extraction strips think blocks and returns empty. |
rescore_mbpp_strip_think.py |
Re-scoring tool that strips <think> blocks and markdown fences before exec(code+tests). Recovered 25 of 158 false failures on this model's mbpp run. |
score_gpqa_partial.py |
Partial-cache GPQA scorer. Replicates lm_eval's multi_choice_regex flexible-extract filter exactly (group_select=−1, ignore_case, ignore_punctuation), looks up cached responses by lm_eval's hash_args("generate_until", [prompt, gen_kwargs]) SHA-256 key, scores against ground truth. Used for the partial 84.75% above when the lm_eval run could not complete the long-tail. |
pod_v4mlp_gpqa.sh |
Full GPQA Diamond eval runner against the v4-MLP server. T=0.6, top_p=0.95, max_gen_toks=16384 (matches v2's published methodology). |
Reproducing the merge
python scripts/dare_ties_merge.py \
--method omnimerge_v2 \
--base /path/to/Qwen3.6-27B \
--source /path/to/Qwen3.6-rico03 \
--source /path/to/Qwen3.6-Esper3.1 \
--source /path/to/Qwen3.6-Opus-Reasoning-anchor \
--weights 0.40,0.35,0.25 \
--density 0.53 \
--darex-q 0.75 \
--output ./Qwen3.6-27B-Omnimerge-v4 \
--seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)
Caveats
- Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw
/v1/completionsfor code benchmarks; chat-completions +--apply_chat_template+ deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. Seepod_omnimerge_v4mlp_eval_raw.shfor the working config. - MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).
Acknowledgements
- Qwen team for the Qwen3.6 base
- rico03, ValiantLabs, kai-os for the fine-tunes
- DARE / TIES / DARE-TIES authors and the arcee-ai/mergekit community
- Downloads last month
- 8,053
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF
Base model
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4