Instructions to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF", dtype="auto") - llama-cpp-python
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF", filename="MiniMax-M3-uncensored-heretic-aggressive-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Use Docker
docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
- SGLang
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Ollama:
ollama run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
- Unsloth Studio
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF to start chatting
- Pi
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Docker Model Runner:
docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
- Lemonade
How to use llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF-Q4_K_M
List all available models
lemonade list
๐ This is a premium gated paid-access model
Access is granted manually after purchase through Ko-fi.
โก๏ธ Purchase access on Ko-fi
After purchasing, include your Hugging Face username in the Ko-fi purchase message, then click โAgree and send request to access repoโ on this Hugging Face page. I will verify the username and manually approve access.
Please allow up to 24 hours for manual approval.
92% fewer refusals (8/100 Uncensored vs 98/100 Original) while preserving model quality (0.0258 KL divergence).
โค๏ธ Support My Work
Creating these models takes significant time, work and compute. If you find them useful consider supporting me:
| Platform | Link | What you get |
|---|---|---|
| ๐ Patreon | Monthly support | Priority model requests |
| โ Ko-fi | One-time tip | My eternal gratitude |
Your help will motivate me and would go into further improving my workflow and coverings fees for storage, compute and may even help uncensoring bigger model with rental Cloud GPUs.
Read before purchase/download:
These GGUF files require a runtime/backend with MiniMax-M3 GGUF architecture support. They are not guaranteed to work in LM Studio, Ollama, KoboldCpp, Jan, or standard/mainline llama.cpp builds unless those runtimes support the minimax-m3 GGUF architecture.
These GGUFs are text/chat-focused.
Vision/multimodal support is not currently available in these GGUFs.
Sparse attention is not currently supported and may fall back to dense attention.
LM Studio / mainline llama.cpp may fail with unknown model architecture: minimax-m3.
Important GGUF compatibility notice
MiniMax-M3 GGUF support is currently a work in progress in llama.cpp.
These GGUF files are provided as the best currently available conversion based on the present upstream MiniMax-M3 GGUF work. However, the current GGUF implementation has known limitations:
- Text generation / plain chat is the main supported use case.
- Vision / multimodal support is not currently available in these GGUFs.
- Sparse attention is not currently supported and may fall back to dense attention, which can affect speed and memory use.
- Tool calling may or may not work depending on your runtime, chat template, parser, and exact setup.
- Compatibility may vary across llama.cpp, Ollama, KoboldCpp, and other GGUF runtimes.
When upstream MiniMax-M3 GGUF support improves, I plan to redo GGUF files then.
Please purchase only if you understand these current limitations. This product is for access to the available GGUF files as-is, not a guarantee that every MiniMax-M3 feature is supported in GGUF today.
GGUF quantization of llmfan46/MiniMax-M3-uncensored-heretic-aggressive.
This is a decensored version of MiniMaxAI/MiniMax-M3, made using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method
Abliteration parameters
| Parameter | Value |
|---|---|
| start_layer_index | 14 |
| end_layer_index | 51 |
| preserve_good_behavior_weight | 0.0847 |
| steer_bad_behavior_weight | 0.0002 |
| overcorrect_relative_weight | 1.1741 |
| neighbor_count | 15 |
Targeted components
- attn.o_proj
Performance
| Metric | This model | Original model (MiniMaxAI/MiniMax-M3) |
|---|---|---|
| KL divergence | 0.0258 | 0 (by definition) |
| Refusals | โ 8/100 | โ 98/100 |
Lower refusals indicate fewer content restrictions, while lower KL divergence indicates more closeness to the original model's baseline. Higher refusals cause more rejections, objections, pushbacks, lecturing, censorship, softening and deflections.
Quantizations
| Filename | Quant | Description |
|---|---|---|
| MiniMax-M3-uncensored-heretic-aggressive-Q5_K_M.gguf | Q5_K_M | Good balance |
| MiniMax-M3-uncensored-heretic-aggressive-Q5_K_S.gguf | Q5_K_S | Smaller Q5 |
| MiniMax-M3-uncensored-heretic-aggressive-Q4_K_M.gguf | Q4_K_M | Good for limited VRAM |
| MiniMax-M3-uncensored-heretic-aggressive-Q4_K_S.gguf | Q4_K_S | Smaller Q4 |
| MiniMax-M3-uncensored-heretic-aggressive-Q3_K_L.gguf | Q3_K_L | Low VRAM, decent quality |
| MiniMax-M3-uncensored-heretic-aggressive-Q3_K_M.gguf | Q3_K_M | Low VRAM, smaller |
| MiniMax-M3-uncensored-heretic-aggressive-Q3_K_S.gguf | Q3_K_S | Very Low VRAM |
| MiniMax-M3-uncensored-heretic-aggressive-Q2_K.gguf | Q2_K | Very Very Low VRAM, only use if you have no other options |
Usage
Right now works with llama.cpp with PR# 24523, see proof (DOWNLOAD SCREENSHOT PROOF HERE):
Compatibility notice
These GGUF files require a runtime/backend with MiniMax-M3 GGUF architecture support.
Tested working on my setup:
- MiniMax-M3-compatible llama.cpp build /
build-minimax llama-serverllama-ui- Tool calling with SearXNG/web search
Expected / likely compatible:
- Unsloth Studio, if using a current version with MiniMax-M3 support
Not currently supported / not guaranteed:
- LM Studio
- Older or mainline llama.cpp builds without MiniMax-M3 support
- Ollama, KoboldCpp, Jan, or other third-party frontends unless their bundled backend supports the
minimax-m3GGUF architecture
If you see an error such as:
unknown model architecture: 'minimax-m3'
or
failed to load model
This does not mean the GGUF file is broken, it just means that your runtime/backend does not support MiniMax-M3 GGUF yet.
Right now MiniMax-M3 GGUFs should be compatible with: MiniMax-M3 compatible llama.cpp build with PR# 24523.
And with: Unsloth Studio
To decrease the probability of unforseen issues due to outdated versions, be sure to use that latest transformers version (very important, won't work unless you either use 5.12.0 or 5.12.1), the latest CUDA versions (very important, do not use anything lower to avoid unforseen issues: 13.0 or 13.1 or 13.2 or 13.3), the latest PyTorch version (very important, use the latest versions of torch either 2.12.0+cu132 or 2.12.1+cu132 and torchvision either 0.27.0+cu132 or 0.27.1+cu132) and the latest Triton versions (3.6.0 or 3.7.0).
Note:
The sections below describe the original Safetensors format of the MiniMax-M3 model. The current GGUF conversion offered here is text/chat-focused and does not currently provide vision/multimodal support or MiniMax Sparse Attention support.
MiniMax-M3 is a native multimodal model with 1M context. It has ~428B parameters and ~23B activated parameters.
Highlights:
- Native Multimodality: M3 undergoes mixed-modality training from the very first step, enabling deeper semantic fusion across text, image, and video.
- Context Scaling via Sparse Attention: M3 introduces MiniMax Sparse Attention (MSA) to improve long context efficiency. M3 delivers 9ร prefill and 15ร decode speedups compared to M2 at 1M context, reducing per-token compute to 1/20.
- Coding & Cowork Capability: M3 achieves frontier-level performance across long-horizon agentic benchmarks, excelling in both coding and cowork.
MiniMax Sparse Attention (MSA)
M3 is powered by MiniMax Sparse Attention (MSA), a high-performance sparse attention operator designed for million-token contexts. Compared with GQA, MSA dramatically reduces the attention compute and memory footprint while preserving model quality.
๐ Read the technical report: arXiv:2606.13392 ยท Hugging Face Papers
How to Use
M3 supports three reasoning modes through the thinking parameter:
enabledโ Reasoning is always enabled.adaptiveโ M3 automatically determines when additional reasoning is beneficial.disabledโ Reasoning is disabled to minimize latency and maximize throughput.
Local Deployment
Download the model:
hf download MiniMaxAI/MiniMax-M3 --local-dir MiniMax-M3
We recommend the following inference frameworks (listed alphabetically) to serve the model:
SGLang - see SGLang cookbook.
vLLM - see vLLM recipes.
Transformers - see Transformers docs.
Inference Parameters
We recommend the following parameters for best performance: temperature=1.0, top_p=0.95, top_k=40.
Contact Us
Contact us at model@minimax.io.
- Downloads last month
- 29
2-bit
3-bit
4-bit
5-bit
Model tree for llmfan46/MiniMax-M3-uncensored-heretic-aggressive-compressed-quants-pack-GGUF
Base model
MiniMaxAI/MiniMax-M3
