Instructions to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF", dtype="auto")

llama-cpp-python

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF",
	filename="MiniMax-M3-uncensored-heretic-aggressive-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Use Docker

docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

SGLang

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Ollama:
```
ollama run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
```

Unsloth Studio

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF to start chatting

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Docker Model Runner:
```
docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
```

Lemonade

How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF-Q4_K_M

List all available models

lemonade list

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🔒 This is a premium gated paid-access model

Access is granted manually after purchase through Ko-fi.

➡️ Purchase access on Ko-fi

After purchasing, include your Hugging Face username in the Ko-fi purchase message, then click “Agree and send request to access repo” on this Hugging Face page. I will verify the username and manually approve access.

Please allow up to 24 hours for manual approval.

92% fewer refusals (8/100 Uncensored vs 98/100 Original) while preserving model quality (0.0258 KL divergence).

❤️ Support My Work

Creating these models takes significant time, work and compute. If you find them useful consider supporting me:

Platform	Link	What you get
🎉 Patreon	Monthly support	Priority model requests
☕ Ko-fi	One-time tip	My eternal gratitude

Your help will motivate me and would go into further improving my workflow and coverings fees for storage, compute and may even help uncensoring bigger model with rental Cloud GPUs.

Read before purchase/download:

These GGUF files require a runtime/backend with MiniMax-M3 GGUF architecture support. They are not guaranteed to work in LM Studio, Ollama, KoboldCpp, Jan, or standard/mainline llama.cpp builds unless those runtimes support the minimax-m3 GGUF architecture.

These GGUFs are text/chat-focused.

Vision/multimodal support is not currently available in these GGUFs.

Sparse attention is not currently supported and may fall back to dense attention.

LM Studio / mainline llama.cpp may fail with unknown model architecture: minimax-m3.

Important GGUF compatibility notice

MiniMax-M3 GGUF support is currently a work in progress in llama.cpp.

These GGUF files are provided as the best currently available conversion based on the present upstream MiniMax-M3 GGUF work. However, the current GGUF implementation has known limitations:

Text generation / plain chat is the main supported use case.
Vision / multimodal support is not currently available in these GGUFs.
Sparse attention is not currently supported and may fall back to dense attention, which can affect speed and memory use.
Tool calling may or may not work depending on your runtime, chat template, parser, and exact setup.
Compatibility may vary across llama.cpp, Ollama, KoboldCpp, and other GGUF runtimes.

When upstream MiniMax-M3 GGUF support improves, I plan to redo GGUF files then.

Please purchase only if you understand these current limitations. This product is for access to the available GGUF files as-is, not a guarantee that every MiniMax-M3 feature is supported in GGUF today.

GGUF quantization of llmfan46/MiniMax-M3-uncensored-heretic-aggressive.

This is a decensored version of MiniMaxAI/MiniMax-M3, made using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method

Abliteration parameters

Parameter	Value
start_layer_index	14
end_layer_index	51
preserve_good_behavior_weight	0.0847
steer_bad_behavior_weight	0.0002
overcorrect_relative_weight	1.1741
neighbor_count	15

Targeted components

attn.o_proj

Performance

Metric	This model	Original model (MiniMaxAI/MiniMax-M3)
KL divergence	0.0258	0 (by definition)
Refusals	✅ 8/100	❌ 98/100

Lower refusals indicate fewer content restrictions, while lower KL divergence indicates more closeness to the original model's baseline. Higher refusals cause more rejections, objections, pushbacks, lecturing, censorship, softening and deflections.

Quantizations

Filename	Quant	Description
MiniMax-M3-uncensored-heretic-aggressive-Q5_K_M.gguf	Q5_K_M	Good balance
MiniMax-M3-uncensored-heretic-aggressive-Q5_K_S.gguf	Q5_K_S	Smaller Q5
MiniMax-M3-uncensored-heretic-aggressive-Q4_K_M.gguf	Q4_K_M	Good for limited VRAM
MiniMax-M3-uncensored-heretic-aggressive-Q4_K_S.gguf	Q4_K_S	Smaller Q4
MiniMax-M3-uncensored-heretic-aggressive-Q3_K_L.gguf	Q3_K_L	Low VRAM, decent quality
MiniMax-M3-uncensored-heretic-aggressive-Q3_K_M.gguf	Q3_K_M	Low VRAM, smaller
MiniMax-M3-uncensored-heretic-aggressive-Q3_K_S.gguf	Q3_K_S	Very Low VRAM
MiniMax-M3-uncensored-heretic-aggressive-Q2_K.gguf	Q2_K	Very Very Low VRAM, only use if you have no other options

Usage

Right now works with llama.cpp with PR# 24523, see proof (DOWNLOAD SCREENSHOT PROOF HERE):

Compatibility notice

These GGUF files require a runtime/backend with MiniMax-M3 GGUF architecture support.

Tested working on my setup:

MiniMax-M3-compatible llama.cpp build / build-minimax
llama-server
llama-ui
Tool calling with SearXNG/web search

Expected / likely compatible:

Unsloth Studio, if using a current version with MiniMax-M3 support

Not currently supported / not guaranteed:

LM Studio
Older or mainline llama.cpp builds without MiniMax-M3 support
Ollama, KoboldCpp, Jan, or other third-party frontends unless their bundled backend supports the minimax-m3 GGUF architecture

If you see an error such as:

unknown model architecture: 'minimax-m3'

failed to load model

This does not mean the GGUF file is broken, it just means that your runtime/backend does not support MiniMax-M3 GGUF yet.

Right now MiniMax-M3 GGUFs should be compatible with: MiniMax-M3 compatible llama.cpp build with PR# 24523.

And with: Unsloth Studio

To decrease the probability of unforseen issues due to outdated versions, be sure to use that latest transformers version (very important, won't work unless you either use 5.12.0 or 5.12.1), the latest CUDA versions (very important, do not use anything lower to avoid unforseen issues: 13.0 or 13.1 or 13.2 or 13.3), the latest PyTorch version (very important, use the latest versions of torch either 2.12.0+cu132 or 2.12.1+cu132 and torchvision either 0.27.0+cu132 or 0.27.1+cu132) and the latest Triton versions (3.6.0 or 3.7.0).

Note:

The sections below describe the original Safetensors format of the MiniMax-M3 model. The current GGUF conversion offered here is text/chat-focused and does not currently provide vision/multimodal support or MiniMax Sparse Attention support.

MiniMax-M3 is a native multimodal model with 1M context. It has ~428B parameters and ~23B activated parameters.

Highlights:

Native Multimodality: M3 undergoes mixed-modality training from the very first step, enabling deeper semantic fusion across text, image, and video.
Context Scaling via Sparse Attention: M3 introduces MiniMax Sparse Attention (MSA) to improve long context efficiency. M3 delivers 9× prefill and 15× decode speedups compared to M2 at 1M context, reducing per-token compute to 1/20.
Coding & Cowork Capability: M3 achieves frontier-level performance across long-horizon agentic benchmarks, excelling in both coding and cowork.

MiniMax Sparse Attention (MSA)

M3 is powered by MiniMax Sparse Attention (MSA), a high-performance sparse attention operator designed for million-token contexts. Compared with GQA, MSA dramatically reduces the attention compute and memory footprint while preserving model quality.

GQA vs MSA Efficiency Comparison

📄 Read the technical report: arXiv:2606.13392 · Hugging Face Papers

How to Use

M3 supports three reasoning modes through the thinking parameter:

enabled — Reasoning is always enabled.
adaptive — M3 automatically determines when additional reasoning is beneficial.
disabled — Reasoning is disabled to minimize latency and maximize throughput.

Local Deployment

Download the model:

hf download MiniMaxAI/MiniMax-M3 --local-dir MiniMax-M3

We recommend the following inference frameworks (listed alphabetically) to serve the model:

Inference Parameters

We recommend the following parameters for best performance: temperature=1.0, top_p=0.95, top_k=40.

Contact Us

Downloads last month: 29

GGUF

Model size

426B params

Architecture

minimax-m3

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

Model tree for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF

Base model

MiniMaxAI/MiniMax-M3

Quantized

(26)

this model

Collection including llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF

MiniMax-M3 Uncensored Heretic

Collection

4 items • Updated about 8 hours ago

Paper for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF

MiniMax Sparse Attention

Paper • 2606.13392 • Published 10 days ago • 141