Instructions to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ssweens/DeepSeek-V4-Flash-GGUF-YMMV",
	filename="deepseek-ai__DeepSeek-V4-Flash-IQ1_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
# Run inference directly in the terminal:
llama cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
# Run inference directly in the terminal:
llama cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
# Run inference directly in the terminal:
./llama-cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Use Docker

docker model run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

LM Studio
Jan

vLLM

How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ssweens/DeepSeek-V4-Flash-GGUF-YMMV"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ssweens/DeepSeek-V4-Flash-GGUF-YMMV",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Ollama
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Ollama:
```
ollama run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
```

Unsloth Studio

How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ssweens/DeepSeek-V4-Flash-GGUF-YMMV to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ssweens/DeepSeek-V4-Flash-GGUF-YMMV to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ssweens/DeepSeek-V4-Flash-GGUF-YMMV to start chatting

How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Docker Model Runner:
```
docker model run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
```

Lemonade

How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-GGUF-YMMV-IQ1_M

List all available models

lemonade list

🌤️ **Update: 2026/05/09 - Core functionality good - dogfooding/testing/refining. **
Focused on real, long running sessions now. Uploaded fresh GGUF. See punch list below for progress. Bugs w/ agent diagnoses welcome. **

🧪 Experimental llama.cpp fork and GGUFs for DeepSeek-V4-Flash

A stopgap to experiment with DeepSeek-V4-Flash with CUDA and ROCm locally while the tools ecosystem catches up. Expect rough edges. Priority is validating for text and coding coherence.

GGUF files for deepseek-ai/DeepSeek-V4-Flash.

⚠️ You need the custom fork

These GGUFs require a DeepSeek-V4-capable fork of llama.cpp. Vanilla llama.cpp doesn't support this architecture yet.

llama.cpp fork: ssweens/llama.cpp-deepseek-v4
Backends: Tested on CPU, CUDA, ROCm and Vulkan.
Compatability: Also compatible with Antirez's ggufs antirez/deepseek-v4-gguf

Example:

llama-server -ngl 99 --no-mmap -fa on -np 1 --reasoning-format auto --jinja --threads 3 -ts 4,4,3 -dev CUDA0,CUDA1,CUDA2 \
-m /mnt/supmodels/gguf/deepseek-ai__DeepSeek-V4-Flash/deepseek-ai__DeepSeek-V4-Flash-Q4_K_M.gguf -c 65536 -b 2048 -ub 512 -ctk q8_0 -ctv q8_0
--chat-template-kwargs '{"reasoning_effort": "high"}'
--reasoning on
--reasoning-budget 1024
--reasoning-budget-message "... thinking budget exceeded, let's answer now."

Performance

Basic Coding coherence (humaneval_instruct, n=30, thinking "high" capped at 1024)

Model	pass@1
IQ1_M	1.000±0.000
IQ2_XXS	1.000±0.000
IQ2_XXS (Antirez)	1.000±0.000
BF16ish	1.000±0.000

Deeper Coding (LiveCodeBench sample, n=30, thinking "high" capped at 1024)

Model	Bench	pass@1	Easy	Medium	Hard
IQ2_XXS	`v5`	76.7%	100%	60%	70%
IQ2_XXS	`v6 only`	73.3%	100%	70%	50%

Speed (llama-benchy, defaults, thinking "high" capped at 1024)

Note: models up to IQ2_XS are CUDA, pipeline parallel mix of consumer RTX.
IQ2_antirez and bigger are CUDA+ROCm, pipeline parallel mix with Strix Halo mixed in.

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS	pp2048	358.44 ± 2.05		5714.56 ± 32.61	5713.91 ± 32.61	5714.56 ± 32.61
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS	tg32	30.62 ± 0.28	31.00 ± 0.00
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS	pp32768	249.63 ± 0.79		131244.94 ± 413.90	131244.26 ± 413.90	131244.94 ± 413.90
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS	tg32	24.54 ± 0.23	25.00 ± 0.00

ROCm Only

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm	pp2048	66.96 ± 1.38		30640.40 ± 637.11	30597.24 ± 637.11	30640.40 ± 637.11
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm	tg32	8.72 ± 0.09	9.00 ± 0.00

Punch List

Status	Feature
X	Simple chat
X	Basic quants
X	iMatrix quants
X	Chat template
X	Tool calling
X	Decent context
X	Pipeline parallelism
?	Tensor parallelism
X	Prompt caching
X	CPU
X	CUDA
X	ROCm
?	Vulkan
X	Cross-platform GPU
X	antirez/ds4 compat
X	Prefill optimization
??	DSv4 Pro compat
N/A	MTP support (believe this to be for training purposes only)

Original model

deepseek-ai/DeepSeek-V4-Flash

Thanks

antirez — llama.cpp fork for Metal and CUDA in llama.cpp-deepseek-v4-flash and DS4
ml-explore/mlx-lm #1192 — MLX DSV4
DeepSeek — open inference code and the technical report
nisparks et al - some early implementation efforts and discussion
llama.cpp — the project that makes local LLM inference possible

Downloads last month: 2,307

GGUF

Model size

284B params

Architecture

deepseek4

Hardware compatibility

1-bit

2-bit

3-bit

16-bit

Model tree for ssweens/DeepSeek-V4-Flash-GGUF-YMMV

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(78)

this model