Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP

NVFP4 (modelopt W4A4) quant of Huihui-Qwen3.6-27B-abliterated — a Qwen3.5-family hybrid model (linear attention + periodic full attention) with a built-in MTP (multi-token-prediction) head for speculative decoding. Multimodal-capable (Qwen3_5ForConditionalGeneration, vision/video tokens) but served here as a text generation / reasoning + tool-calling model. Fits 4× 16 GB Blackwell (SM120).

  • ~7.2 GiB/GPU weights at TP=4 · 64K–262K context · <think> reasoning · XML tool-calls
  • Single-stream ~81–83 tok/s (TP=4, MTP n=3); peak ~880 tok/s @ 24 concurrent (64K)

TL;DR — run it (no build required)

The official vLLM image already ships the qwen3_5 architecture and the Qwen3_5MTP draft module, so you do not need to build anything.

# from this directory; pick exactly TP_SIZE GPUs and avoid your display GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up
./run.sh test          # waits for /v1/models
./run.sh bench         # one-shot smoke test

Or the raw docker run (what run.sh/compose.yaml wrap):

docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \
  -p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \
  --entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \
    --served-model-name huihui-qwen36-27b-local \
    --trust-remote-code --tensor-parallel-size 4 --quantization modelopt \
    --max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
    --chat-template /model/chat_template.jinja \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --host 0.0.0.0 --port 8000

Smoke test:

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model":"huihui-qwen36-27b-local",
  "messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}],
  "max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq .

Hardware & requirements

  • 4× NVIDIA RTX PRO 2000 Blackwell (16 GB each, SM120), PCIe (no NVLink).
  • Docker + NVIDIA Container Toolkit. The pre-built vllm/vllm-openai:v0.22.0 image carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code.
  • TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4.

Bare-metal (no container) also works: pip install vllm (≥0.21 introduced qwen3_5), CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the same vllm serve flags.


Flags, and why

flag value why
--quantization modelopt required checkpoint is NVFP4 (hf_quant_config.json); without it weights read as garbage.
--trust-remote-code recommended qwen3_5 multimodal config.
--tensor-parallel-size 4 model needs ~7.2 GiB/GPU; 4× 16 GB is the design point.
--max-model-len 65536 (≤ 262144) hybrid attention keeps KV cheap — long context is affordable.
--max-num-seqs 8 (peak 24 @64K) concurrent slots. See benchmarks for the throughput curve.
--kv-cache-dtype fp8 recommended ~2× KV capacity for more concurrency / longer context.
--gpu-memory-utilization 0.85 model ≈7.2 GiB/GPU → ~6 GiB left for KV. Raise only on a clean card.
--reasoning-parser qwen3 recommended splits <think>…</think> into reasoning_content; answer in content.
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' recommended turns on the MTP draft head. vLLM ≥0.22 auto-maps qwen3_5_mtp → mtp (harmless deprecation warning). SPEC_TOKENS=0 disables it.
--enable-auto-tool-choice --tool-call-parser qwen3_xml optional (agentic) parses Qwen3 XML tool calls. Drop for pure chat (ENABLE_TOOLS=0).

Sampling (Qwen default, generation_config.json): temperature=0.7, top_k=20, top_p=0.95. It is a reasoning model — give it room (max_tokens ≥ 512).


Docker package (bundled)

compose.yaml · entrypoint.sh · run.sh · Dockerfile. The compose defaults to the official image + a mounted entrypoint.sh (build-free). Every flag is env-overridable:

CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up        # start on those 4 GPUs
MAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context mode
MAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmark
SPEC_TOKENS=0 ./run.sh up                       # disable MTP speculative decoding
ENABLE_TOOLS=0 ./run.sh up                      # pure chat (no tool parser)
PORT=8001 ./run.sh up                           # serve on a different host port
./run.sh logs   # tail   ·   ./run.sh down   # stop

Env knobs: PORT, MAX_MODEL_LEN, MAX_NUM_SEQS, MAX_NUM_BATCHED_TOKENS, GPU_MEM_UTIL, KV_CACHE_DTYPE, SPEC_TOKENS, TP_SIZE, ENABLE_TOOLS, REASONING_PARSER, TOOL_CALL_PARSER, CUDA_VISIBLE_DEVICES, VLLM_IMAGE. The model weights are mounted read-only (. → /model); the image carries only the runtime. shm_size: 32g is set (vLLM V1 uses a lot of shared memory).

To build a self-contained image instead: uncomment the build: block in compose.yaml and run ./run.sh rebuild (the Dockerfile just pip-installs vLLM on a CUDA 13.1 base).


Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3)

Conditions: 512 output tokens, ~350-token prompt, --kv-cache-dtype fp8, --gpu-memory-utilization 0.85.

64K context

Req Aggregate Per-req Req Aggregate Per-req
1 81.0 t/s 81.0 14 669.9 t/s 49.5
2 134.0 t/s 67.0 16 720.2 t/s 46.1
3 205.1 t/s 71.6 18 764.7 t/s 44.2
4 274.5 t/s 72.5 20 798.7 t/s 41.6
6 380.3 t/s 65.2 22 835.0 t/s 39.5
8 454.2 t/s 58.9 24 879.5 t/s 37.2
10 518.9 t/s 53.7 28 859.7 t/s 31.7
12 613.8 t/s 52.6 32 736.8 t/s 32.1

256K context (1→16 req)

83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s (per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full + 48/64 linear attention) stays cheap at length.

Takeaways: peak throughput is ~880 tok/s @ 24 concurrent (64K), decaying past 28. Long context is nearly free: 256K runs 16-way without OOM. For 256K use --max-model-len 262144 --max-num-seqs 8; for a 128K single-request line ~83.9 tok/s (--max-num-seqs 1).


Rituals (gotchas)

  1. Kill zombie GPU procs — a failed/cancelled launch leaves workers in VRAM: nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheaderkill -9 <Worker_TP* PIDs>.
  2. First launch is slow — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for Application startup complete / Uvicorn running on http://0.0.0.0:8000.
  3. gpu-memory-utilization must exceed real usage — clean start ≈7.2 GiB/GPU; with 0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV. Free memory < desired… = residual allocation from a previous run (see #1).
  4. Concurrent NCCL init can hang — bringing up two TP servers at once may spin one at NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them one at a time, or set NCCL_P2P_DISABLE=1 for the smaller group.
  5. MTP acceptancenum_speculative_tokens>1 reuses one MTP layer per step; higher values trade acceptance for draft depth. n=3 is a good default here.

OpenCode provider

// ~/.config/opencode/opencode.jsonc
{
  "provider": {
    "local-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local vLLM",
      "options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" },
      "models": {
        "huihui-qwen36-27b-local": {
          "name": "Huihui Qwen3.6 27B NVFP4 MTP Local",
          "reasoning": true, "tool_call": true, "temperature": true,
          "limit": { "context": 65536, "output": 8192 }
        }
      }
    }
  },
  "model": "local-vllm/huihui-qwen36-27b-local",
  "small_model": "local-vllm/huihui-qwen36-27b-local"
}

What's inside

  • Quantized → NVFP4 (modelopt 0.43, W4A4, group 16): the Linear layers; lm_head, conv/short-conv, routers and the MTP embedding kept higher precision (ignore list in config.json / hf_quant_config.json).
  • MTP draft head (mtp_num_hidden_layers: 1) → speculative decoding via vLLM.
  • Files: model.safetensors (~20 GB), config.json, hf_quant_config.json, chat_template.jinja, tokenizer, and this Docker package.
Downloads last month
147,640
Safetensors
Model size
17B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support