Qwen 3.6 35B-A3B Hi-Fi — MTP runtime variant

Runtime-verified. Same calibration as the Hi-Fi Q4_K_M variant, with the MTP draft head (blk.40) retained for native speculative decoding in llama-server.

🖥️ Datacenter target (A100-80GB). The speedup below is a datacenter-GPU result. On Apple Silicon / consumer GPUs there is no measured speedup — for local runs use the Hi-Fi GGUF.

One-line claim

Qwen 3.6 35B-A3B MTP runtime path verified in llama-server: ~76.0% draft acceptance (95% CI 75.7–76.3%) and ~1.49× decode speedup (95% CI 1.46–1.52×) on a 50-prompt chat-template suite, A100-80GB, k=4. 3-seed lock (seeds 42, 1337, 2024).

This is runtime verification, not a calibrated draft-head claim. The MTP block inherits the Hi-Fi main-model quant policy plus one Q8_0 override on nextn.eh_proj.

Requirements

  • llama-server with Qwen 3.5/3.6 NextN MTP loader (llama.cpp main commit 2f6c815dc… or later)
  • Send requests via the /v1/chat/completions endpoint (NOT /completion — see caveat below)

How to run

llama-server \
  -m qwen36-35b-a3b-hi-fi-mtp-runtime.gguf \
  -ngl 999 -c 4096 \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 --spec-draft-n-min 1

Request body:

{
  "messages": [{"role": "user", "content": "Write a Python function ..."}],
  "max_tokens": 256,
  "temperature": 0,
  "speculative": {"type": "draft-mtp", "n_max": 4, "n_min": 1}
}

Inspect acceptance via timings.draft_n (proposed) and timings.draft_n_accepted (accepted).

Measured numbers (3-seed lock, 2026-05-29)

arm mean acceptance ±2σ mean tok/s speedup vs vanilla
overall 75.99 % ±0.30 pp 198.96 1.489× ± 0.030
code 73.69 % ±0.64 pp 193.82 1.451×
math 78.07 % ±0.36 pp 204.23 1.529×
general 76.61 % ±0.48 pp 198.67 1.487×

Vanilla (non-spec) decode baseline: 133.61 tok/s. Hardware: A100-80GB. Context: 4K. n_predict: 64. Greedy (temp=0, top_k=1).

Per-seed: seed 42 → 75.97%/1.472× · seed 1337 → 76.14%/1.502× · seed 2024 → 75.85%/1.493×.

Draft-window sensitivity

k acceptance tok/s speedup
3 73.0 % 173.8 1.30×
4 (shipped) 76.0 % 199.0 1.49×
6 47.3 % 153.0 1.14×
8 35.7 % 109.8 0.82×

k=4 is the measured optimum. Higher k collapses per-position acceptance because deeper drafts compound on the head's own (potentially wrong) previous predictions. Lower k underuses the draft window.

Caveats

  • Invocation-sensitive. Sending raw prompts to /completion (no chat template) drops acceptance to ~63% and speedup to 1.30×. Use /v1/chat/completions or apply the Qwen chat template manually.
  • Datacenter artifact — A100-80GB only. The 1.49× speedup is measured on A100-80GB at 4K context. On consumer hardware (Apple Silicon / M-series, single consumer GPUs) the A3B MoE is compute-bound, so speculative verification yields no measured net speedup — MTP on those targets is a compatibility smoke, not a speed win. For local / Apple Silicon use, run the Hi-Fi GGUF instead (same calibration, no MTP head, no special runtime). Longer contexts may also shift the speedup.
  • Requires llama-server. llama-cpp-python (as of 2026-05-29) does not expose the MTP draft path. Bare llama-cli does not run server-side speculative decoding.
  • Runtime-verified, not calibrated. The MTP block uses the main-model imatrix (which has no activation statistics for blk.40). Whether a calibrated draft head improves acceptance is open research.

Provenance

  • Base model: unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF BF16 revision 5c2410d71524f4f72b023ce8daf7a80528226d5f (MTP-inclusive)
  • Imatrix: imatrix_codemath_v1_256k.dat (sha 5872a78f610050d2fccdce0c13ae450a472647c9fb297fe0a7ccaf2dfa945460) — calibrated on the code+math corpus, 256K tokens
  • llama.cpp: main commit 2f6c815dc450106ef877ae32f4472bfd5cf83e47
  • Artifact sha256: 457114c45cd1b918ce71cfe01dcc7f4f70a0853092c911d9c6760ef16e25443c

All measurements + sensitivity probes are reproducible from the receipts under receipts/.

Related

License

Inherits Qwen 3.6 license from the upstream base model.

Downloads last month
594
GGUF
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraQtl/Qwen3.6-35B-A3B-Hi-Fi-MTP-runtime

Quantized
(69)
this model