Qwen 3.6 35B-A3B Hi-Fi — MTP runtime variant

Runtime-verified. Same calibration as the Hi-Fi Q4_K_M variant, with the MTP draft head (blk.40) retained for native speculative decoding in llama-server.

🖥️ Datacenter target (A100-80GB). The speedup below is a datacenter-GPU result. On Apple Silicon / consumer GPUs there is no measured speedup — for local runs use the Hi-Fi GGUF.

One-line claim

Qwen 3.6 35B-A3B MTP runtime path verified in llama-server: ~76.0% draft acceptance (95% CI 75.7–76.3%) and ~1.49× decode speedup (95% CI 1.46–1.52×) on a 50-prompt chat-template suite, A100-80GB, k=4. 3-seed lock (seeds 42, 1337, 2024).

This is runtime verification, not a calibrated draft-head claim. The MTP block inherits the Hi-Fi main-model quant policy plus one Q8_0 override on nextn.eh_proj.

Requirements

llama-server with Qwen 3.5/3.6 NextN MTP loader (llama.cpp main commit 2f6c815dc… or later)
Send requests via the /v1/chat/completions endpoint (NOT /completion — see caveat below)

How to run

llama-server \
  -m qwen36-35b-a3b-hi-fi-mtp-runtime.gguf \
  -ngl 999 -c 4096 \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 --spec-draft-n-min 1

Request body:

{
  "messages": [{"role": "user", "content": "Write a Python function ..."}],
  "max_tokens": 256,
  "temperature": 0,
  "speculative": {"type": "draft-mtp", "n_max": 4, "n_min": 1}
}

Inspect acceptance via timings.draft_n (proposed) and timings.draft_n_accepted (accepted).

Measured numbers (3-seed lock, 2026-05-29)

arm	mean acceptance	±2σ	mean tok/s	speedup vs vanilla
overall	75.99 %	±0.30 pp	198.96	1.489× ± 0.030
code	73.69 %	±0.64 pp	193.82	1.451×
math	78.07 %	±0.36 pp	204.23	1.529×
general	76.61 %	±0.48 pp	198.67	1.487×

Vanilla (non-spec) decode baseline: 133.61 tok/s. Hardware: A100-80GB. Context: 4K. n_predict: 64. Greedy (temp=0, top_k=1).

Per-seed: seed 42 → 75.97%/1.472× · seed 1337 → 76.14%/1.502× · seed 2024 → 75.85%/1.493×.

Draft-window sensitivity

k	acceptance	tok/s	speedup
3	73.0 %	173.8	1.30×
4 (shipped)	76.0 %	199.0	1.49×
6	47.3 %	153.0	1.14×
8	35.7 %	109.8	0.82×

k=4 is the measured optimum. Higher k collapses per-position acceptance because deeper drafts compound on the head's own (potentially wrong) previous predictions. Lower k underuses the draft window.

Caveats

Invocation-sensitive. Sending raw prompts to /completion (no chat template) drops acceptance to ~63% and speedup to 1.30×. Use /v1/chat/completions or apply the Qwen chat template manually.
Datacenter artifact — A100-80GB only. The 1.49× speedup is measured on A100-80GB at 4K context. On consumer hardware (Apple Silicon / M-series, single consumer GPUs) the A3B MoE is compute-bound, so speculative verification yields no measured net speedup — MTP on those targets is a compatibility smoke, not a speed win. For local / Apple Silicon use, run the Hi-Fi GGUF instead (same calibration, no MTP head, no special runtime). Longer contexts may also shift the speedup.
Requires llama-server. llama-cpp-python (as of 2026-05-29) does not expose the MTP draft path. Bare llama-cli does not run server-side speculative decoding.
Runtime-verified, not calibrated. The MTP block uses the main-model imatrix (which has no activation statistics for blk.40). Whether a calibrated draft head improves acceptance is open research.

Provenance

Base model: unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF BF16 revision 5c2410d71524f4f72b023ce8daf7a80528226d5f (MTP-inclusive)
Imatrix: imatrix_codemath_v1_256k.dat (sha 5872a78f610050d2fccdce0c13ae450a472647c9fb297fe0a7ccaf2dfa945460) — calibrated on the code+math corpus, 256K tokens
llama.cpp: main commit 2f6c815dc450106ef877ae32f4472bfd5cf83e47
Artifact sha256: 457114c45cd1b918ce71cfe01dcc7f4f70a0853092c911d9c6760ef16e25443c

All measurements + sensitivity probes are reproducible from the receipts under receipts/.

Hi-Fi (main model): fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF — same calibration, no MTP head.
D1 (Mistral 7B fraQtl sidecars): fraQtl/Mistral-7B-v0.3-fraqtl-sidecars.

License

Inherits Qwen 3.6 license from the upstream base model.

Downloads last month: 594

GGUF

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for fraQtl/Qwen3.6-35B-A3B-Hi-Fi-MTP-runtime

Base model

Qwen/Qwen3-Next-80B-A3B-Instruct

Quantized

(69)

this model

fraQtl
/

Qwen3.6-35B-A3B-Hi-Fi-MTP-runtime