Qwen3.6-27B-MTP-pi-reasoning · ROCmFP4 (STRIX)

A ROCmFP4 4-bit quant of bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF, built for AMD Strix Halo (Ryzen AI MAX+, gfx1151) with multi-token-prediction (MTP) self-speculative decoding. Quantized from the BF16 source with the Q4_0_ROCMFP4_STRIX quality preset.

⚠️ This is not a stock GGUF. The q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types only load in the charlie12345/rocmfp4-llama fork of llama.cpp (branch mtp-rocmfp4-strix). It will not load in upstream llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Hugging Face's auto-parser may mislabel the file as "F16"; it is really a ~4.4 bpw 4-bit model.

What is ROCmFP4?

ROCmFP4 is an AMD-focused 4-bit GGUF weight format from the fork above. It pairs a Codebook10 4-bit value table with finite unsigned E4M3 half-scales, in two layouts:

  • q4_0_rocmfp4: dual-scale (~4.5 bpw), used on precision-sensitive tensors (attention projections here).
  • q4_0_rocmfp4_fast: single-scale (~4.25 bpw), used on the bulk of the network for speed.

The _STRIX preset is a tensor-aware mix. It protects what matters for coherence while keeping the body small and fast on the AMD ROCm/HIP and Vulkan paths.

This build

Format ROCmFP4 4-bit (Q4_0_ROCMFP4_STRIX)
Effective precision 4.38 bpw
File size ~14.0 GiB (14,986,109,152 bytes)
Architecture qwen35 hybrid attention + SSM, 65 blocks (64 plus 1 MTP)
Context up to 262,144 tokens
MTP nextn draft head carried through (self-speculative)
imatrix none (plain preset quant)
Vision not included, language weights only (see note below)

Tensor recipe (as quantized):

Tensor group Type
token_embd q6_K
attention K/V projections q4_0_rocmfp4 (dual-scale)
FFN, output (lm-head), MTP eh_proj, rest q4_0_rocmfp4_fast (single-scale)
norms, SSM params (ssm_*), MTP norms f32

Quick start (llama.cpp fork)

Build the fork first (see its README; Strix uses scripts/build-strix-rocmfp4-mtp.sh), then:

HSA_OVERRIDE_GFX_VERSION=11.5.1 \
GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
./build-strix-rocmfp4/bin/llama-server \
  -m Qwen3.6-27B-MTP-pi-reasoning-ROCmFP4-STRIX.gguf \
  -dev ROCm0 -ngl 999 \
  -c 262144 -b 512 -ub 512 -fa on \
  -ctk q4_0 -ctv q4_0 \
  --spec-type draft-mtp \
  --spec-draft-type-k q4_0 --spec-draft-type-v q4_0 \
  --spec-draft-n-max 4 --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
  --reasoning on --jinja

Drop the --spec-* flags to run without MTP, or the --reasoning/--jinja flags for plain completion.

Local performance

Measured on a Framework Desktop / Ryzen AI MAX+ 395, Radeon 8060S (gfx1151), 128 GB unified memory, ROCm backend, fork build 4795079b0. Numbers are local and depend on driver, context, and prompt.

Raw throughput, llama-bench (-ngl 999 -fa 1 -r 3), no speculative decoding:

test t/s
prefill (pp512) 389.3 ± 2.4
decode (tg128) 13.8 ± 0.03

llama-bench cannot exercise MTP, so the decode figure above is raw single-token decode. MTP self-speculative decoding (the intended way to run this model) roughly doubles it, with the gain depending heavily on how predictable the output is. Measured on llama-server over 256-token generations:

Workload Decode tok/s Draft acceptance
Code generation ~35.5 ~74% (190/257)
Short reasoning ~34.9 ~72% (88/123)
Free-form reasoning ~27.1 ~51% (170/335)

So expect roughly 27 to 36 tok/s with MTP (about 2x to 2.6x over raw decode), trending higher on code and structured output.

Reproduce

# source: the BF16 GGUF from bytkim
./build-strix-rocmfp4/bin/llama-quantize \
  Qwen3.6-27B-MTP-pi-reasoning-bf16.gguf \
  Qwen3.6-27B-MTP-pi-reasoning-ROCmFP4-STRIX.gguf \
  Q4_0_ROCMFP4_STRIX

Vision

The source model is vision-capable via the Qwen3.6 mmproj-F16.gguf sidecar, but this repo ships language weights only. To enable images, pair this GGUF with the matching mmproj from the source repo and pass --mmproj to the fork.

Lineage & credits

License

Apache 2.0, inherited from the upstream Qwen3.6-27B base model. You may use, modify, and redistribute this quant and its derivatives subject to that license.

Downloads last month
-
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PatrickScully/Qwen3.6-27B-MTP-pi-reasoning-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(1)
this model