Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP

NVFP4 (modelopt W4A4) quant of Huihui-Qwen3.6-27B-abliterated — a Qwen3.5-family hybrid model (linear attention + periodic full attention) with a built-in MTP (multi-token-prediction) head for speculative decoding. Multimodal-capable (Qwen3_5ForConditionalGeneration, vision/video tokens) but served here as a text generation / reasoning + tool-calling model. Fits 4× 16 GB Blackwell (SM120).

~7.2 GiB/GPU weights at TP=4 · 64K–262K context · <think> reasoning · XML tool-calls
Single-stream ~81–83 tok/s (TP=4, MTP n=3); peak ~880 tok/s @ 24 concurrent (64K)

TL;DR — run it (no build required)

The official vLLM image already ships the qwen3_5 architecture and the Qwen3_5MTP draft module, so you do not need to build anything.

# from this directory; pick exactly TP_SIZE GPUs and avoid your display GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up
./run.sh test          # waits for /v1/models
./run.sh bench         # one-shot smoke test

Or the raw docker run (what run.sh/compose.yaml wrap):

docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \
  -p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \
  --entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \
    --served-model-name huihui-qwen36-27b-local \
    --trust-remote-code --tensor-parallel-size 4 --quantization modelopt \
    --max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
    --chat-template /model/chat_template.jinja \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --host 0.0.0.0 --port 8000

Smoke test:

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model":"huihui-qwen36-27b-local",
  "messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}],
  "max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq .

Hardware & requirements

4× NVIDIA RTX PRO 2000 Blackwell (16 GB each, SM120), PCIe (no NVLink).
Docker + NVIDIA Container Toolkit. The pre-built vllm/vllm-openai:v0.22.0 image carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code.
TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4.

Bare-metal (no container) also works: pip install vllm (≥0.21 introduced qwen3_5), CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the same vllm serve flags.

Flags, and why

flag	value	why
`--quantization modelopt`	required	checkpoint is NVFP4 (`hf_quant_config.json`); without it weights read as garbage.
`--trust-remote-code`	recommended	`qwen3_5` multimodal config.
`--tensor-parallel-size`	`4`	model needs ~7.2 GiB/GPU; 4× 16 GB is the design point.
`--max-model-len`	`65536` (≤ `262144`)	hybrid attention keeps KV cheap — long context is affordable.
`--max-num-seqs`	`8` (peak `24` @64K)	concurrent slots. See benchmarks for the throughput curve.
`--kv-cache-dtype fp8`	recommended	~2× KV capacity for more concurrency / longer context.
`--gpu-memory-utilization`	`0.85`	model ≈7.2 GiB/GPU → ~6 GiB left for KV. Raise only on a clean card.
`--reasoning-parser qwen3`	recommended	splits `<think>…</think>` into `reasoning_content`; answer in `content`.
`--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'`	recommended	turns on the MTP draft head. vLLM ≥0.22 auto-maps `qwen3_5_mtp → mtp` (harmless deprecation warning). `SPEC_TOKENS=0` disables it.
`--enable-auto-tool-choice --tool-call-parser qwen3_xml`	optional (agentic)	parses Qwen3 XML tool calls. Drop for pure chat (`ENABLE_TOOLS=0`).

Sampling (Qwen default, generation_config.json): temperature=0.7, top_k=20, top_p=0.95. It is a reasoning model — give it room (max_tokens ≥ 512).

Docker package (bundled)

compose.yaml · entrypoint.sh · run.sh · Dockerfile. The compose defaults to the official image + a mounted entrypoint.sh (build-free). Every flag is env-overridable:

CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up        # start on those 4 GPUs
MAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context mode
MAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmark
SPEC_TOKENS=0 ./run.sh up                       # disable MTP speculative decoding
ENABLE_TOOLS=0 ./run.sh up                      # pure chat (no tool parser)
PORT=8001 ./run.sh up                           # serve on a different host port
./run.sh logs   # tail   ·   ./run.sh down   # stop

Env knobs: PORT, MAX_MODEL_LEN, MAX_NUM_SEQS, MAX_NUM_BATCHED_TOKENS, GPU_MEM_UTIL, KV_CACHE_DTYPE, SPEC_TOKENS, TP_SIZE, ENABLE_TOOLS, REASONING_PARSER, TOOL_CALL_PARSER, CUDA_VISIBLE_DEVICES, VLLM_IMAGE. The model weights are mounted read-only (. → /model); the image carries only the runtime. shm_size: 32g is set (vLLM V1 uses a lot of shared memory).

To build a self-contained image instead: uncomment the build: block in compose.yaml and run ./run.sh rebuild (the Dockerfile just pip-installs vLLM on a CUDA 13.1 base).

Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3)

Conditions: 512 output tokens, ~350-token prompt, --kv-cache-dtype fp8, --gpu-memory-utilization 0.85.

64K context

Req	Aggregate	Per-req	Req	Aggregate	Per-req
1	81.0 t/s	81.0	14	669.9 t/s	49.5
2	134.0 t/s	67.0	16	720.2 t/s	46.1
3	205.1 t/s	71.6	18	764.7 t/s	44.2
4	274.5 t/s	72.5	20	798.7 t/s	41.6
6	380.3 t/s	65.2	22	835.0 t/s	39.5
8	454.2 t/s	58.9	24	879.5 t/s	37.2
10	518.9 t/s	53.7	28	859.7 t/s	31.7
12	613.8 t/s	52.6	32	736.8 t/s	32.1

256K context (1→16 req)

83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s (per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full + 48/64 linear attention) stays cheap at length.

Takeaways: peak throughput is ~880 tok/s @ 24 concurrent (64K), decaying past 28. Long context is nearly free: 256K runs 16-way without OOM. For 256K use --max-model-len 262144 --max-num-seqs 8; for a 128K single-request line ~83.9 tok/s (--max-num-seqs 1).

Rituals (gotchas)

Kill zombie GPU procs — a failed/cancelled launch leaves workers in VRAM: nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader → kill -9 <Worker_TP* PIDs>.
First launch is slow — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for Application startup complete / Uvicorn running on http://0.0.0.0:8000.
gpu-memory-utilization must exceed real usage — clean start ≈7.2 GiB/GPU; with 0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV. Free memory < desired… = residual allocation from a previous run (see #1).
Concurrent NCCL init can hang — bringing up two TP servers at once may spin one at NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them one at a time, or set NCCL_P2P_DISABLE=1 for the smaller group.
MTP acceptance — num_speculative_tokens>1 reuses one MTP layer per step; higher values trade acceptance for draft depth. n=3 is a good default here.

OpenCode provider

// ~/.config/opencode/opencode.jsonc
{
  "provider": {
    "local-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local vLLM",
      "options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" },
      "models": {
        "huihui-qwen36-27b-local": {
          "name": "Huihui Qwen3.6 27B NVFP4 MTP Local",
          "reasoning": true, "tool_call": true, "temperature": true,
          "limit": { "context": 65536, "output": 8192 }
        }
      }
    }
  },
  "model": "local-vllm/huihui-qwen36-27b-local",
  "small_model": "local-vllm/huihui-qwen36-27b-local"
}

What's inside

Quantized → NVFP4 (modelopt 0.43, W4A4, group 16): the Linear layers; lm_head, conv/short-conv, routers and the MTP embedding kept higher precision (ignore list in config.json / hf_quant_config.json).
MTP draft head (mtp_num_hidden_layers: 1) → speculative decoding via vLLM.
Files: model.safetensors (~20 GB), config.json, hf_quant_config.json, chat_template.jinja, tokenizer, and this Docker package.

Downloads last month: 147,640

Safetensors

Model size

17B params

Tensor type

BF16

F8_E4M3