Qwen3.6-27B-Omnimerge-v4-MTP-GGUF

GGUF quantizations of ManniX-ITA/Qwen3.6-27B-Omnimerge-v4 with the MTP (Multi-Token Prediction) head retained for self-speculative decoding on llama.cpp mainline (PR #22673, merged 2026-05-16) and later.

Companion to the standard-decode release at ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF. The two repos contain identical merged weights — this one keeps the additional mtp.* tensors that convert_hf_to_gguf.py remaps to blk.{num_hidden_layers}.* per llama.cpp PR #22673 ("llama + spec: MTP Support", merged 2026-05-16), so --spec-type draft-mtp works out of the box. All quants made with imatrix using bartowski's calibration_datav5; imatrix.dat archived alongside the quants for reproducibility/audit.

Available Quantizations

Quantization Size (GiB) Notes
F16 50.90 full-precision reference
Q8_0 27.05
Q6_K 20.89 recommended speed/quality balance
Q5_K_M 18.19
Q4_K_M 15.66
IQ4_XS 14.26
IQ3_M 11.89
IQ2_M 9.54 MTP head forced to Q4_K — see note below

IQ2_M MTP-head override. The 7 MTP-head tensors (blk.64.attn_{k,q,v,output}.weight, blk.64.ffn_{down,gate,up}.weight, blk.64.nextn.eh_proj.weight) are overridden to Q4_K instead of the K-mix's default IQ2_S/IQ3_S. Reason: llama-imatrix only activates the standard text-decode path, so the MTP draft head accumulates zero importance entries; llama-quantize then refuses very-low-bit quants (IQ1_*, IQ2_*) on those tensors and bails mid-write (producing a deceptively size-correct but zero-header file). Forcing the MTP block to Q4_K is the cheap workaround — adds ~180 MB versus pure IQ2_S (9.54 GiB vs std-release IQ2_M ~9.32 GiB), keeps the file complete and the MTP path bit-exact. The other 859 tensors retain the IQ2_M K-mix. Recipe (rebuild from F16 in this repo):

llama-quantize --imatrix imatrix.dat \
  --tensor-type blk.64.attn_k.weight=q4_K \
  --tensor-type blk.64.attn_q.weight=q4_K \
  --tensor-type blk.64.attn_output.weight=q4_K \
  --tensor-type blk.64.ffn_down.weight=q4_K \
  --tensor-type blk.64.ffn_gate.weight=q4_K \
  --tensor-type blk.64.ffn_up.weight=q4_K \
  --tensor-type blk.64.nextn.eh_proj.weight=q4_K \
  Qwen3.6-27B-Omnimerge-v4-F16.gguf Qwen3.6-27B-Omnimerge-v4-IQ2_M.gguf IQ2_M

Tiers from IQ3_M and up don't have the strict imatrix requirement; the MTP block falls back gracefully without overrides.

Also published as ollama tags: mannix/omnimerge-v4-mtp.

How to Use — MTP speculative decoding

Stock llama.cpp containing PR #22673 ("llama + spec: MTP Support", merged 2026-05-16) or later. Confirmed working on commit bb28c1f. Older commits without this PR will load the weights but ignore the mtp.* head — you'll get standard decode with no error, just no speedup.

llama-server (recommended)

llama-server -m Qwen3.6-27B-Omnimerge-v4-Q6_K.gguf \
    -c 16384 -ngl 99 \
    --parallel 1 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    --port 8099

Key flags:

  • --spec-type draft-mtp — enables MTP self-speculative decoding using the included mtp.* head.
  • --spec-draft-n-max 3 — how many tokens the MTP head proposes per step. 3 is the sweet spot; higher values increase verification cost without gaining acceptance.
  • -c 16384 --parallel 1 — tuned for a 24 GB GPU (e.g. RTX 3090) running the Q6_K weights (≈ 21 GB) + draft buffer + KV. Bump -c to 32768+ and --parallel 2 on a 32 GB+ GPU.

llama-cli

llama-cli -m Qwen3.6-27B-Omnimerge-v4-Q6_K.gguf \
    --spec-type draft-mtp --spec-draft-n-max 3 \
    -p "Write a Python function that ..." -n 512

Without MTP

This repo also works as a drop-in for the standard release — just omit the --spec-type flag. You'll get identical pass@1 quality at standard decode speed.

Benchmark Results (Q6_K, MTP vs standard)

All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama-server running this Q6_K against the identical-weights Q6_K from the standard-decode release. Two configs evaluated:

  • std = standard release Q6_K, --parallel 2 -c 65536 (no spec)
  • MTP = this release Q6_K, --spec-type draft-mtp --spec-draft-n-max 3 --parallel 1 -c 16384

Both use --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0. Pass@1 is the lm_eval rescored number after <think>...</think>-block stripping (necessary for this reasoning model — raw lm_eval exec(prompt + completion + tests) SyntaxErrors on the literal < in <think>).

Decode tokens/sec is the aggregate decode throughput as measured by the per-completion print_timing lines on llama-server stderr (sum-of-decode-tokens ÷ sum-of-decode-seconds across the bench).

Benchmark std Q6_K pass@1 MTP Q6_K pass@1 std tok/s (agg) MTP tok/s (agg) MTP speedup
HumanEval (164q, 0-shot) 83.54 % (137/164) 83.54 % (137/164) 29.85 60.22 2.02 ×
MBPP (500q, 3-shot) 73.00 % (365/500) 75.00 % (375/500) 24.33 56.75 2.33 ×
GPQA Diamond (198q, 0-shot CoT)† 78.28 % (155/198) 77.78 % (154/198) 26.24 56.59 2.16 ×

† GPQA Diamond reported as flexible-extract (the canonical metric — the model's chain-of-thought ends in a free-form answer that the regex extractor parses). Companion strict-match (exact final-token match) is 7.58 % std / 9.60 % MTP — both quite low because the model emits CoT verbosely rather than the rigid "Answer: A" template the strict matcher wants; the flex score is the real quality signal. Identical chain config to HE/MBPP (--reasoning-budget 8192, sampler greedy temperature=0, max_length=32768); std runs --parallel 2 -c 65536, MTP runs --spec-type draft-mtp --spec-draft-n-max 3 --parallel 1 -c 16384. Wall time: std 4 h 55 min, MTP 4 h 35 min.

The HE exact-match (137/164 ↔ 137/164) and GPQA near-parity (155/198 ↔ 154/198 — single-question delta well inside the ±2.94 % stderr on 198 samples) are the headline quality claims: MTP is statistically indistinguishable from std on this model. The +2 pp MBPP delta (10 problems out of 500) is at the edge of the ±2 pp rescore-stderr band and may still be (a) real because MTP's token-emission order under verification differs subtly even under greedy decoding due to tie-breaking, (b) sampling noise, or (c) an artifact of the think-strip rescore parser interacting differently with the two streams — treat it as suggestive only. The throughput win (2.0-2.3 ×) is the operational headline.

Why the speedup is 2× rather than 4×

The MTP head's draft acceptance rate measured on HumanEval was ~81 % (#acc 7678 / #gen 9478). On paper a 4-token draft (--spec-draft-n-max 3 plus the verifier-implicit base token = 4 total) at 81 % acceptance gives a per-slot speedup of ~3-4 × over single-slot non-spec decoding. We observe that exactly: MTP at 60.22 tok/s vs std-1-slot 14.9 tok/s (= std-aggregate 29.85 ÷ 2 slots) is a 4.0 × per-slot win. The 2 × aggregate speedup is because the std baseline runs --parallel 2 (two concurrent slots sharing the GPU), whereas MTP fits at --parallel 1 only on a 24 GB GPU. On a larger GPU where MTP can also run --parallel 2, the aggregate would track the per-slot 4 ×.

For single-request latency (interactive chat, code assistants, agent loops) MTP delivers the full 4 × benefit on this GPU class.

GPQA holds the same ratio (2.16 ×) despite producing much longer reasoning-tail completions — the 81 % HE acceptance rate generalizes to reasoning-heavy CoT outputs more cleanly than the cautious 1.5-2 × estimate in the earlier draft. This is the empirically validated speedup on this GPU class for both code and CoT-reasoning workloads.

Known Limitations

  • IQ2_M MTP head is Q4_K, not IQ2_S. See "IQ2_M MTP-head override" above. The MTP draft head receives Q4_K (imatrix-free) treatment while the rest of the model is true IQ2_M. Functionally indistinguishable from a "pure" IQ2_M for inference; the file is ~180 MB larger than the std-release IQ2_M (9.54 GiB here vs 9.32 GiB std). For interactive use this means slightly higher draft-head memory footprint, but no measurable change in speedup or quality versus a hypothetical fully-IQ2_M MTP build (which can't currently be produced without a draft-mode imatrix calibration pass).

  • MBPP delta vs std is not yet a confirmed quality win. See "Benchmark Results" note above — +2 pp could be noise. Quality claim is "indistinguishable from std on HE-164, near-parity on GPQA Diamond-198 (within stderr), suggestive-but-unconfirmed +2 pp on MBPP-500".

  • GPQA Diamond is the verified long-form reasoning data point (added 2026-05-22 in T89). MTP holds the 2.16 × speedup observed on code benches while staying statistically tied with std (154/198 vs 155/198 = ∆ −0.5 pp, well inside ±2.94 % stderr). The earlier "1.5-2 × per-slot expected on reasoning" estimate was conservative — the 81 % acceptance rate from HumanEval generalizes cleanly to GPQA's 5-15 k decode-token reasoning tails on this specific model. This may not hold on other reasoning models — re-measure if you swap weights.

  • Tied to a specific llama.cpp commit. All numbers in this card are measured on commit bb28c1f of llama.cpp master (post-PR #22673). Future llama.cpp updates may change the per-token throughput (better KV-attention kernels, etc.); the absolute tok/s numbers should be read as a relative comparison against std on the same commit, not as an absolute prediction for other versions.

  • 24 GB GPU class only. All measurements are on a single RTX 3090. On smaller GPUs (16 GB) the MTP path won't fit at all — drop to a smaller quant or use the standard release. On larger GPUs (32 GB+, e.g. RTX 4090 Pro, L40, A100), the MTP path can run --parallel 2 like std and should track the per-slot 4 × win in aggregate too — but we haven't validated that here.

  • MTP-only consumer at time of publish: llama.cpp. Other backends are not yet wired up to the mtp.* tensors:

    • Ollama has PR #15980 in active development but no stable release. Loading this GGUF in Ollama today gives standard decode (no speedup, no error).
    • Llamafile has no MTP support; discussion #632 is open. Same fallback behavior — works at standard decode.
    • vLLM / SGLang / TGI: do not load the mtp.* head from this GGUF. Use the source HF safetensors model with the appropriate MTP-aware engine if those backends gain support.
  • Vision tower compatibility unverified for MTP path. This repo contains the text-only MTP weights. For multimodal inference the standard bartowski/Qwen_Qwen3.6-27B-GGUF mmproj projector should still work, but we haven't validated that MTP draft acceptance holds when the prompt includes vision tokens. If you hit issues with --mmproj + --spec-type draft-mtp together, drop the spec flag.

  • max_length gotcha when reproducing. lm-eval's local-completions defaults max_length=2048 which truncates MBPP 3-shot prompts and any reasoning-budget-8192 GPQA prompt below zero residual budget → llama-server returns [invalid]. Our chain script sets max_length=32768 explicitly. If you swap in your own eval invocation, set this or the model will appear to score 0 %.

Reproducing the eval

The full chain (download std Q6_K → bench std → bench MTP → score → throughput-parse) is committed in omnimergekit/scripts/pod_v4_q6k_eval_chain.sh. Key gotchas baked in:

  • lm-eval local-completions defaults max_length=2048 which truncates MBPP 3-shot prompts (and any GPQA prompt) below the prompt size — leaves max_gen_toks budget negative → server returns [invalid] sentinel. The chain script sets max_length=32768 explicitly. Without this, MBPP/GPQA score 0 % despite the model working fine. Confirmed bug on lm-eval 0.4.11.
  • MTP server requires reduced ctx on 24 GB GPUs-c 16384 --parallel 1 fits Q6_K (21 GB) + draft KV at ≈ 23.5 GB. Default -c 65536 --parallel 2 OOMs.
  • Rescore is mandatory — the raw lm_eval pass@1 under-reports by 5-10 pp because exec(prompt + "<think>...</think>" + code) SyntaxErrors. The chain's rescore_strip_think.py recovers the real number.

Original Model Card

Qwen3.6-27B-Omnimerge-v4 (MLP-passthrough)

Same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B + 3 Qwen3.6 fine-tunes, with MLP-passthrough surgery applied to defend against a fragility we discovered in Qwen3.6's reasoning-tag emission policy. Successor to ManniX-ITA/Qwen3.5-27B-Omnimerge-v2 on the newer Qwen3.6 base.

GPQA Diamond: partial result (192/198 cached, 177 matched, ≈ 84.75% pass@1). See note below — final result blocked by an aiohttp lifecycle bug in lm_eval's local-completions adapter that consistently crashes the eval on the last 6 reasoning-tail questions where responses run 9+ minutes each. HumanEval and MBPP are final.

Quantizations

Three release lines:

GGUF (llama.cpp / ollama / text-generation-webui)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF — 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5. imatrix.dat archived alongside the quants for reproducibility/audit.

Also published as ollama tags: mannix/omnimerge-v4.

The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).

MLX 4-bit — text-only (Apple Silicon)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit — text-only 4-bit MLX (group_size 64, 4.501 bits/weight), ~15 GB, loads via mlx_lm.load. Use this if you don't need vision and want a slightly smaller download.

from mlx_lm import load, generate
model, tokenizer = load("ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit")
print(generate(model, tokenizer, prompt="...", max_tokens=512, verbose=True))

MLX 4-bit — Vision-Language (Apple Silicon, multimodal)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit — full multimodal 4-bit MLX (group_size 64, 4.695 bits/weight — vision tower kept at higher precision), ~16 GB, loads via mlx_vlm.load. Use this for image + video input.

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)

prompt = apply_chat_template(processor, config,
    "Describe the image in detail.", num_images=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, image=["path/to/image.png"]))

Sources

Source Weight Role
Qwen/Qwen3.6-27B base base + chat template
rico03/Qwen3.6-27B-rico03 0.40 general capability
ValiantLabs/Qwen3.6-27B-Esper3.1 0.35 code + reasoning
kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor) 0.25 reasoning anchor

Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.

Benchmark Results (Q6_K quantization)

All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0 except GPQA at 0.6 to match v2's published methodology.

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs.

Benchmark Qwen3.6 base Q6_K (bartowski) Omnimerge-v2 (Qwen3.5 base) Omnimerge-v4-MLP (Qwen3.6 base) Δ vs base Δ vs v2
HumanEval pass@1 (164q) 84.76% (139/164) 79.27% 84.76% (139/164) 0.00 pp +5.49 pp
MBPP pass@1 (500q) — raw lm_eval 56.20% n/a 68.40% +12.20 pp n/a
MBPP pass@1 (500q) — corrected* 57.60% 74.60% 73.40% +15.80 pp −1.20 pp
GPQA Diamond pass@1 (flex) — see ‡ not measured (∇) 69.19% (full 198q) ≈ 84.75% (partial 177q) ≈ +15.5 pp

Key observations:

  • HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
  • MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
  • GPQA is the marquee win — ≈ +15.5 pp over v2 (which itself was +16 pp over its source models). The Qwen3.6 base brings stronger reasoning, and the merge preserves and slightly amplifies it.

∇ Skipped a base GPQA run because (a) v2's published GPQA is the canonical reference for "is this merge valuable?" — that's what we benchmark against, and (b) the same aiohttp lifecycle bug that bit our v4-MLP run would have bit a base run too.

* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.

  • v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
  • Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
  • v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.

Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.

GPQA partial result (important caveat): the full lm_eval run completed 192/198 questions before crashing repeatedly on the last 6. Root cause is an aiohttp lifecycle issue in lm_eval.models.api_models.amodel_call: the at-budget reasoning responses (16384 tokens × ~9 minutes wall time) consistently outlast the aiohttp ClientSession and the resulting RuntimeError: Session is closed is unrecoverable within the same process. We restarted lm_eval 5 times across a ~4-hour window; each restart gained ~1 question before crashing on the same long-tail. Final 6 questions were not scored. The 84.75% is computed by scripts/score_gpqa_partial.py which replicates lm_eval's exact multi_choice_regex flexible-extract filter (group_select=−1, ignore_case=True, ignore_punctuation=True) over the 192 cached responses. Of those, 177 prompts matched our process_docs-replicated GPQA prompts (the 15 unmatched are minor unicode-normalization or seed-timing artifacts in the reconstruction; the 6 uncached are the at-budget tail). 150/177 correct → 84.75% partial pass@1. The unmatched 15 + uncached 6 are unlikely to swing the headline number more than ±1 pp; final result will land in the 82-86% band. We also separately patched lm_eval's api_models.py:545 UnboundLocalError bug as a prerequisite (it crashes on transient TimeoutError before outputs is assigned) — see scripts/score_gpqa_partial.py and the inline patch recipe in this repo's commit history for the exact replication.

Why "MLP-passthrough"

When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.

We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Test Clean Qwen3.6 base v4 (full merge, broken) v4-MLP (this model)
<think> open rate (mbpp-10 isolation) 40% 80% 0%
Unclosed </think> 0/4 88% of opens 0/10
MBPP pass@1 (mbpp-10 isolation) 40% 20% 50%
Empty response (chat-completions) low 80% 0/10

Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.

The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.

Compatibility

Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).

Inference works under:

  • transformers (BF16) — both use_cache=True and False paths
  • llama.cpp (GGUF) — recommended args: --reasoning-format deepseek --reasoning-budget 8192
  • vLLM (untested at time of publish, expected to work)

Scripts

All merge tooling is in the scripts/ directory of this repo:

Script Purpose
dare_ties_merge.py Main merger. --method omnimerge_v2 is the published method. Auto-detects Qwen3.6 base via config.output_gate_type and auto-applies --skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj' (override with --no-auto-mlp-skip).
v4_mlp_passthrough.py Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6.
inspect_v4_delta.py Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region.
pod_omnimerge_v4_build.sh Full reproducible build script (download sources, run merge, convert + quantize Q6_K).
pod_omnimerge_v4mlp_eval_raw.sh Eval orchestrator: mbpp + humaneval via raw /v1/completions. Required for reasoning-tag-emitting models — apply_chat_template + deepseek extraction strips think blocks and returns empty.
rescore_mbpp_strip_think.py Re-scoring tool that strips <think> blocks and markdown fences before exec(code+tests). Recovered 25 of 158 false failures on this model's mbpp run.
score_gpqa_partial.py Partial-cache GPQA scorer. Replicates lm_eval's multi_choice_regex flexible-extract filter exactly (group_select=−1, ignore_case, ignore_punctuation), looks up cached responses by lm_eval's hash_args("generate_until", [prompt, gen_kwargs]) SHA-256 key, scores against ground truth. Used for the partial 84.75% above when the lm_eval run could not complete the long-tail.
pod_v4mlp_gpqa.sh Full GPQA Diamond eval runner against the v4-MLP server. T=0.6, top_p=0.95, max_gen_toks=16384 (matches v2's published methodology).

Reproducing the merge

python scripts/dare_ties_merge.py \
    --method omnimerge_v2 \
    --base /path/to/Qwen3.6-27B \
    --source /path/to/Qwen3.6-rico03 \
    --source /path/to/Qwen3.6-Esper3.1 \
    --source /path/to/Qwen3.6-Opus-Reasoning-anchor \
    --weights 0.40,0.35,0.25 \
    --density 0.53 \
    --darex-q 0.75 \
    --output ./Qwen3.6-27B-Omnimerge-v4 \
    --seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)

Caveats

  • Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw /v1/completions for code benchmarks; chat-completions + --apply_chat_template + deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. See pod_omnimerge_v4mlp_eval_raw.sh for the working config.
  • MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).

Acknowledgements

Downloads last month
8,053
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF

Quantized
(5)
this model