Qwen 3.6 35B-A3B Hi-Fi — MTP runtime variant
Runtime-verified. Same calibration as the Hi-Fi Q4_K_M variant, with the MTP draft head (blk.40) retained for native speculative decoding in llama-server.
🖥️ Datacenter target (A100-80GB). The speedup below is a datacenter-GPU result. On Apple Silicon / consumer GPUs there is no measured speedup — for local runs use the Hi-Fi GGUF.
One-line claim
Qwen 3.6 35B-A3B MTP runtime path verified in llama-server: ~76.0% draft acceptance (95% CI 75.7–76.3%) and ~1.49× decode speedup (95% CI 1.46–1.52×) on a 50-prompt chat-template suite, A100-80GB, k=4. 3-seed lock (seeds 42, 1337, 2024).
This is runtime verification, not a calibrated draft-head claim. The MTP block inherits the Hi-Fi main-model quant policy plus one Q8_0 override on nextn.eh_proj.
Requirements
llama-serverwith Qwen 3.5/3.6 NextN MTP loader (llama.cpp main commit2f6c815dc…or later)- Send requests via the
/v1/chat/completionsendpoint (NOT/completion— see caveat below)
How to run
llama-server \
-m qwen36-35b-a3b-hi-fi-mtp-runtime.gguf \
-ngl 999 -c 4096 \
--spec-type draft-mtp \
--spec-draft-n-max 4 --spec-draft-n-min 1
Request body:
{
"messages": [{"role": "user", "content": "Write a Python function ..."}],
"max_tokens": 256,
"temperature": 0,
"speculative": {"type": "draft-mtp", "n_max": 4, "n_min": 1}
}
Inspect acceptance via timings.draft_n (proposed) and timings.draft_n_accepted (accepted).
Measured numbers (3-seed lock, 2026-05-29)
| arm | mean acceptance | ±2σ | mean tok/s | speedup vs vanilla |
|---|---|---|---|---|
| overall | 75.99 % | ±0.30 pp | 198.96 | 1.489× ± 0.030 |
| code | 73.69 % | ±0.64 pp | 193.82 | 1.451× |
| math | 78.07 % | ±0.36 pp | 204.23 | 1.529× |
| general | 76.61 % | ±0.48 pp | 198.67 | 1.487× |
Vanilla (non-spec) decode baseline: 133.61 tok/s. Hardware: A100-80GB. Context: 4K. n_predict: 64. Greedy (temp=0, top_k=1).
Per-seed: seed 42 → 75.97%/1.472× · seed 1337 → 76.14%/1.502× · seed 2024 → 75.85%/1.493×.
Draft-window sensitivity
| k | acceptance | tok/s | speedup |
|---|---|---|---|
| 3 | 73.0 % | 173.8 | 1.30× |
| 4 (shipped) | 76.0 % | 199.0 | 1.49× |
| 6 | 47.3 % | 153.0 | 1.14× |
| 8 | 35.7 % | 109.8 | 0.82× |
k=4 is the measured optimum. Higher k collapses per-position acceptance because deeper drafts compound on the head's own (potentially wrong) previous predictions. Lower k underuses the draft window.
Caveats
- Invocation-sensitive. Sending raw prompts to
/completion(no chat template) drops acceptance to ~63% and speedup to 1.30×. Use/v1/chat/completionsor apply the Qwen chat template manually. - Datacenter artifact — A100-80GB only. The 1.49× speedup is measured on A100-80GB at 4K context. On consumer hardware (Apple Silicon / M-series, single consumer GPUs) the A3B MoE is compute-bound, so speculative verification yields no measured net speedup — MTP on those targets is a compatibility smoke, not a speed win. For local / Apple Silicon use, run the Hi-Fi GGUF instead (same calibration, no MTP head, no special runtime). Longer contexts may also shift the speedup.
- Requires
llama-server. llama-cpp-python (as of 2026-05-29) does not expose the MTP draft path. Barellama-clidoes not run server-side speculative decoding. - Runtime-verified, not calibrated. The MTP block uses the main-model imatrix (which has no activation statistics for
blk.40). Whether a calibrated draft head improves acceptance is open research.
Provenance
- Base model:
unsloth/Qwen3-Next-80B-A3B-Instruct-GGUFBF16 revision5c2410d71524f4f72b023ce8daf7a80528226d5f(MTP-inclusive) - Imatrix:
imatrix_codemath_v1_256k.dat(sha5872a78f610050d2fccdce0c13ae450a472647c9fb297fe0a7ccaf2dfa945460) — calibrated on the code+math corpus, 256K tokens - llama.cpp: main commit
2f6c815dc450106ef877ae32f4472bfd5cf83e47 - Artifact sha256:
457114c45cd1b918ce71cfe01dcc7f4f70a0853092c911d9c6760ef16e25443c
All measurements + sensitivity probes are reproducible from the receipts under receipts/.
Related
- Hi-Fi (main model):
fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF— same calibration, no MTP head. - D1 (Mistral 7B fraQtl sidecars):
fraQtl/Mistral-7B-v0.3-fraqtl-sidecars.
License
Inherits Qwen 3.6 license from the upstream base model.
- Downloads last month
- 594
We're not able to determine the quantization variants.
Model tree for fraQtl/Qwen3.6-35B-A3B-Hi-Fi-MTP-runtime
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct