Instructions to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ssweens/DeepSeek-V4-Flash-GGUF-YMMV", filename="deepseek-ai__DeepSeek-V4-Flash-IQ1_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M # Run inference directly in the terminal: llama cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M # Run inference directly in the terminal: llama cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M # Run inference directly in the terminal: ./llama-cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Use Docker
docker model run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
- LM Studio
- Jan
- vLLM
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ssweens/DeepSeek-V4-Flash-GGUF-YMMV" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ssweens/DeepSeek-V4-Flash-GGUF-YMMV", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
- Ollama
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Ollama:
ollama run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
- Unsloth Studio
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ssweens/DeepSeek-V4-Flash-GGUF-YMMV to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ssweens/DeepSeek-V4-Flash-GGUF-YMMV to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ssweens/DeepSeek-V4-Flash-GGUF-YMMV to start chatting
- Pi
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Docker Model Runner:
docker model run hf.co/ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
- Lemonade
How to use ssweens/DeepSeek-V4-Flash-GGUF-YMMV with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ssweens/DeepSeek-V4-Flash-GGUF-YMMV:IQ1_M
Run and chat with the model
lemonade run user.DeepSeek-V4-Flash-GGUF-YMMV-IQ1_M
List all available models
lemonade list
🌤️ **Update: 2026/05/09 - Core functionality good - dogfooding/testing/refining. **
Focused on real, long running sessions now. Uploaded fresh GGUF. See punch list below for progress. Bugs w/ agent diagnoses welcome. **
🧪 Experimental llama.cpp fork and GGUFs for DeepSeek-V4-Flash
A stopgap to experiment with DeepSeek-V4-Flash with CUDA and ROCm locally while the tools ecosystem catches up. Expect rough edges. Priority is validating for text and coding coherence.
GGUF files for deepseek-ai/DeepSeek-V4-Flash.
⚠️ You need the custom fork
These GGUFs require a DeepSeek-V4-capable fork of llama.cpp. Vanilla llama.cpp doesn't support this architecture yet.
- llama.cpp fork: ssweens/llama.cpp-deepseek-v4
- Backends: Tested on CPU, CUDA, ROCm and Vulkan.
- Compatability: Also compatible with Antirez's ggufs antirez/deepseek-v4-gguf
Example:
llama-server -ngl 99 --no-mmap -fa on -np 1 --reasoning-format auto --jinja --threads 3 -ts 4,4,3 -dev CUDA0,CUDA1,CUDA2 \
-m /mnt/supmodels/gguf/deepseek-ai__DeepSeek-V4-Flash/deepseek-ai__DeepSeek-V4-Flash-Q4_K_M.gguf -c 65536 -b 2048 -ub 512 -ctk q8_0 -ctv q8_0
--chat-template-kwargs '{"reasoning_effort": "high"}'
--reasoning on
--reasoning-budget 1024
--reasoning-budget-message "... thinking budget exceeded, let's answer now."
Performance
Basic Coding coherence (humaneval_instruct, n=30, thinking "high" capped at 1024)
| Model | pass@1 |
|---|---|
| IQ1_M | 1.000±0.000 |
| IQ2_XXS | 1.000±0.000 |
| IQ2_XXS (Antirez) | 1.000±0.000 |
| BF16ish | 1.000±0.000 |
Deeper Coding (LiveCodeBench sample, n=30, thinking "high" capped at 1024)
| Model | Bench | pass@1 | Easy | Medium | Hard |
|---|---|---|---|---|---|
| IQ2_XXS | v5 |
76.7% | 100% | 60% | 70% |
| IQ2_XXS | v6 only |
73.3% | 100% | 70% | 50% |
Speed (llama-benchy, defaults, thinking "high" capped at 1024)
Note: models up to IQ2_XS are CUDA, pipeline parallel mix of consumer RTX.
IQ2_antirez and bigger are CUDA+ROCm, pipeline parallel mix with Strix Halo mixed in.
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | pp2048 | 358.44 ± 2.05 | 5714.56 ± 32.61 | 5713.91 ± 32.61 | 5714.56 ± 32.61 | |
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | tg32 | 30.62 ± 0.28 | 31.00 ± 0.00 | |||
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | pp32768 | 249.63 ± 0.79 | 131244.94 ± 413.90 | 131244.26 ± 413.90 | 131244.94 ± 413.90 | |
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | tg32 | 24.54 ± 0.23 | 25.00 ± 0.00 |
ROCm Only
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm | pp2048 | 66.96 ± 1.38 | 30640.40 ± 637.11 | 30597.24 ± 637.11 | 30640.40 ± 637.11 | |
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm | tg32 | 8.72 ± 0.09 | 9.00 ± 0.00 |
Punch List
| Status | Feature |
|---|---|
| X | Simple chat |
| X | Basic quants |
| X | iMatrix quants |
| X | Chat template |
| X | Tool calling |
| X | Decent context |
| X | Pipeline parallelism |
| ? | Tensor parallelism |
| X | Prompt caching |
| X | CPU |
| X | CUDA |
| X | ROCm |
| ? | Vulkan |
| X | Cross-platform GPU |
| X | antirez/ds4 compat |
| X | Prefill optimization |
| ?? | DSv4 Pro compat |
| N/A | MTP support (believe this to be for training purposes only) |
Original model
Thanks
- antirez — llama.cpp fork for Metal and CUDA in llama.cpp-deepseek-v4-flash and DS4
- ml-explore/mlx-lm #1192 — MLX DSV4
- DeepSeek — open inference code and the technical report
- nisparks et al - some early implementation efforts and discussion
- llama.cpp — the project that makes local LLM inference possible
- Downloads last month
- 2,307
1-bit
2-bit
3-bit
16-bit
Model tree for ssweens/DeepSeek-V4-Flash-GGUF-YMMV
Base model
deepseek-ai/DeepSeek-V4-Flash