Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP
NVFP4 (modelopt W4A4) quant of Huihui-Qwen3.6-27B-abliterated — a Qwen3.5-family
hybrid model (linear attention + periodic full attention) with a built-in MTP
(multi-token-prediction) head for speculative decoding. Multimodal-capable
(Qwen3_5ForConditionalGeneration, vision/video tokens) but served here as a text
generation / reasoning + tool-calling model. Fits 4× 16 GB Blackwell (SM120).
- ~7.2 GiB/GPU weights at TP=4 · 64K–262K context ·
<think>reasoning · XML tool-calls - Single-stream ~81–83 tok/s (TP=4, MTP n=3); peak ~880 tok/s @ 24 concurrent (64K)
TL;DR — run it (no build required)
The official vLLM image already ships the qwen3_5 architecture and the Qwen3_5MTP
draft module, so you do not need to build anything.
# from this directory; pick exactly TP_SIZE GPUs and avoid your display GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up
./run.sh test # waits for /v1/models
./run.sh bench # one-shot smoke test
Or the raw docker run (what run.sh/compose.yaml wrap):
docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \
-p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \
-e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \
--entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \
--served-model-name huihui-qwen36-27b-local \
--trust-remote-code --tensor-parallel-size 4 --quantization modelopt \
--max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
--chat-template /model/chat_template.jinja \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--host 0.0.0.0 --port 8000
Smoke test:
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model":"huihui-qwen36-27b-local",
"messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}],
"max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq .
Hardware & requirements
- 4× NVIDIA RTX PRO 2000 Blackwell (16 GB each, SM120), PCIe (no NVLink).
- Docker + NVIDIA Container Toolkit. The pre-built
vllm/vllm-openai:v0.22.0image carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code. - TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4.
Bare-metal (no container) also works:
pip install vllm(≥0.21 introduced qwen3_5), CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the samevllm serveflags.
Flags, and why
| flag | value | why |
|---|---|---|
--quantization modelopt |
required | checkpoint is NVFP4 (hf_quant_config.json); without it weights read as garbage. |
--trust-remote-code |
recommended | qwen3_5 multimodal config. |
--tensor-parallel-size |
4 |
model needs ~7.2 GiB/GPU; 4× 16 GB is the design point. |
--max-model-len |
65536 (≤ 262144) |
hybrid attention keeps KV cheap — long context is affordable. |
--max-num-seqs |
8 (peak 24 @64K) |
concurrent slots. See benchmarks for the throughput curve. |
--kv-cache-dtype fp8 |
recommended | ~2× KV capacity for more concurrency / longer context. |
--gpu-memory-utilization |
0.85 |
model ≈7.2 GiB/GPU → ~6 GiB left for KV. Raise only on a clean card. |
--reasoning-parser qwen3 |
recommended | splits <think>…</think> into reasoning_content; answer in content. |
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' |
recommended | turns on the MTP draft head. vLLM ≥0.22 auto-maps qwen3_5_mtp → mtp (harmless deprecation warning). SPEC_TOKENS=0 disables it. |
--enable-auto-tool-choice --tool-call-parser qwen3_xml |
optional (agentic) | parses Qwen3 XML tool calls. Drop for pure chat (ENABLE_TOOLS=0). |
Sampling (Qwen default, generation_config.json): temperature=0.7, top_k=20,
top_p=0.95. It is a reasoning model — give it room (max_tokens ≥ 512).
Docker package (bundled)
compose.yaml · entrypoint.sh · run.sh · Dockerfile. The compose defaults to the
official image + a mounted entrypoint.sh (build-free). Every flag is env-overridable:
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up # start on those 4 GPUs
MAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context mode
MAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmark
SPEC_TOKENS=0 ./run.sh up # disable MTP speculative decoding
ENABLE_TOOLS=0 ./run.sh up # pure chat (no tool parser)
PORT=8001 ./run.sh up # serve on a different host port
./run.sh logs # tail · ./run.sh down # stop
Env knobs: PORT, MAX_MODEL_LEN, MAX_NUM_SEQS, MAX_NUM_BATCHED_TOKENS,
GPU_MEM_UTIL, KV_CACHE_DTYPE, SPEC_TOKENS, TP_SIZE, ENABLE_TOOLS,
REASONING_PARSER, TOOL_CALL_PARSER, CUDA_VISIBLE_DEVICES, VLLM_IMAGE.
The model weights are mounted read-only (. → /model); the image carries only the runtime.
shm_size: 32g is set (vLLM V1 uses a lot of shared memory).
To build a self-contained image instead: uncomment the build: block in compose.yaml
and run ./run.sh rebuild (the Dockerfile just pip-installs vLLM on a CUDA 13.1 base).
Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3)
Conditions: 512 output tokens, ~350-token prompt, --kv-cache-dtype fp8,
--gpu-memory-utilization 0.85.
64K context
| Req | Aggregate | Per-req | Req | Aggregate | Per-req | |
|---|---|---|---|---|---|---|
| 1 | 81.0 t/s | 81.0 | 14 | 669.9 t/s | 49.5 | |
| 2 | 134.0 t/s | 67.0 | 16 | 720.2 t/s | 46.1 | |
| 3 | 205.1 t/s | 71.6 | 18 | 764.7 t/s | 44.2 | |
| 4 | 274.5 t/s | 72.5 | 20 | 798.7 t/s | 41.6 | |
| 6 | 380.3 t/s | 65.2 | 22 | 835.0 t/s | 39.5 | |
| 8 | 454.2 t/s | 58.9 | 24 | 879.5 t/s | 37.2 | |
| 10 | 518.9 t/s | 53.7 | 28 | 859.7 t/s | 31.7 | |
| 12 | 613.8 t/s | 52.6 | 32 | 736.8 t/s | 32.1 |
256K context (1→16 req)
83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s
(per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full +
48/64 linear attention) stays cheap at length.
Takeaways: peak throughput is ~880 tok/s @ 24 concurrent (64K), decaying past 28.
Long context is nearly free: 256K runs 16-way without OOM. For 256K use
--max-model-len 262144 --max-num-seqs 8; for a 128K single-request line ~83.9 tok/s
(--max-num-seqs 1).
Rituals (gotchas)
- Kill zombie GPU procs — a failed/cancelled launch leaves workers in VRAM:
nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader→kill -9 <Worker_TP* PIDs>. - First launch is slow — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for
Application startup complete/Uvicorn running on http://0.0.0.0:8000. gpu-memory-utilizationmust exceed real usage — clean start ≈7.2 GiB/GPU; with 0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV.Free memory < desired…= residual allocation from a previous run (see #1).- Concurrent NCCL init can hang — bringing up two TP servers at once may spin one at
NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them one at a time,
or set
NCCL_P2P_DISABLE=1for the smaller group. - MTP acceptance —
num_speculative_tokens>1reuses one MTP layer per step; higher values trade acceptance for draft depth.n=3is a good default here.
OpenCode provider
// ~/.config/opencode/opencode.jsonc
{
"provider": {
"local-vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "Local vLLM",
"options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" },
"models": {
"huihui-qwen36-27b-local": {
"name": "Huihui Qwen3.6 27B NVFP4 MTP Local",
"reasoning": true, "tool_call": true, "temperature": true,
"limit": { "context": 65536, "output": 8192 }
}
}
}
},
"model": "local-vllm/huihui-qwen36-27b-local",
"small_model": "local-vllm/huihui-qwen36-27b-local"
}
What's inside
- Quantized → NVFP4 (modelopt 0.43, W4A4, group 16): the Linear layers;
lm_head, conv/short-conv, routers and the MTP embedding kept higher precision (ignorelist inconfig.json/hf_quant_config.json). - MTP draft head (
mtp_num_hidden_layers: 1) → speculative decoding via vLLM. - Files:
model.safetensors(~20 GB),config.json,hf_quant_config.json,chat_template.jinja, tokenizer, and this Docker package.
- Downloads last month
- 147,640