Instructions to use yigagilbert/stepaudio2-mini-luganda-english-s2st with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="yigagilbert/stepaudio2-mini-luganda-english-s2st", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("yigagilbert/stepaudio2-mini-luganda-english-s2st", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yigagilbert/stepaudio2-mini-luganda-english-s2st"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yigagilbert/stepaudio2-mini-luganda-english-s2st",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/yigagilbert/stepaudio2-mini-luganda-english-s2st

SGLang

How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yigagilbert/stepaudio2-mini-luganda-english-s2st" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yigagilbert/stepaudio2-mini-luganda-english-s2st",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yigagilbert/stepaudio2-mini-luganda-english-s2st" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yigagilbert/stepaudio2-mini-luganda-english-s2st",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with Docker Model Runner:
```
docker model run hf.co/yigagilbert/stepaudio2-mini-luganda-english-s2st
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Step-Audio 2 Mini Luganda-to-English S2ST

This repository contains a full merged model for Luganda speech input to English speech translation output. It was created by merging the LoRA adapter yigagilbert/stepaudio2-mini-luganda-english-s2st-lora into stepfun-ai/Step-Audio-2-mini.

The separate adapter-only repository should remain available for users who prefer PEFT loading or want the smaller adapter artifact. This repository is intended for simpler deployment and inference where loading a single model repo is preferable.

Intended Use

Research and development for Luganda-to-English speech translation. Validate outputs with native speakers before production use.

Source Model and Adapter

Base model: stepfun-ai/Step-Audio-2-mini
LoRA adapter: yigagilbert/stepaudio2-mini-luganda-english-s2st-lora
Merge script: scripts/push_full_model_to_hub.py

Evaluation Summary

The table below summarizes the held-out validation evaluation used during development. All text metrics were computed on 200 validation examples. Speech metrics were computed on the aligned 197-example subset for which generated audio existed.

This full model was created by merging the LoRA adapter into the base model. The metrics below were generated with the adapter-loaded fine-tuned model before merge; the merged model contains the same adapted weights and is expected to match these results aside from normal deterministic or runtime differences. Re-run evaluation directly on this repository before a strict release if exact reproducibility is required.

Text and Semantic Metrics

System	Loading Form	Count	BLEU higher	chrF higher	WER lower	COMET higher	BLASER ref higher	BLASER QE higher
Base Step-Audio-2-mini	Base only	200	0.012	5.152	10.702	0.386	1.713	2.164
Fine-tuned Step-Audio	Base + LoRA adapter	200	32.530	54.535	0.574	0.717	3.762	3.723
This full merged model	Same adapted weights, merged	200	32.530*	54.535*	0.574*	0.717*	3.762*	3.723*
Cascade baseline	ASR + MT + TTS	200	36.778	57.971	0.521	0.737	3.839	3.776

* The full merged row reflects the adapter evaluation because this repository is produced by folding the evaluated LoRA adapter into the same base model.

Speech Metrics

System	Count	chrF higher	SpeechBERT P higher	SpeechBERT R higher	SpeechBERT F1 higher	MCD lower
Fine-tuned Step-Audio	197	54.582	0.644	0.648	0.645	629.718
This full merged model	197	54.582*	0.644*	0.648*	0.645*	629.718*
Cascade baseline	197	57.893	0.603	0.622	0.612	613.212

The unfine-tuned base model emitted no valid speech-token sequences in this 200-sample run, so SpeechBERTScore and MCD were not computed for it.

Interpretation

The fine-tuned Step-Audio model substantially improves over the base model and becomes a viable end-to-end Luganda-to-English speech translation system. The cascade remains stronger on text and semantic metrics, while the fine-tuned Step-Audio system is simpler to deploy and scored higher on the WavLM-based SpeechBERTScore F1 proxy.

Notes

If this repository includes token2wav/, those assets are provided to support waveform synthesis from generated audio tokens. Some inference clients may still use the official Step-Audio2 runtime code for token-to-waveform conversion.

License

The training code and adapter metadata are Apache-2.0. Because this merged repository contains base-model weights, users must also comply with the base model license and any dataset licensing constraints.

Downloads last month: 20

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for yigagilbert/stepaudio2-mini-luganda-english-s2st

Base model

stepfun-ai/Step-Audio-2-mini

Quantized

(1)

this model

Evaluation results

BLEU on Luganda-English Cleaned v1 Split
validation set self-reported

32.530
chrF on Luganda-English Cleaned v1 Split
validation set self-reported

54.535
WER on generated English text on Luganda-English Cleaned v1 Split
validation set self-reported

0.574
COMET on Luganda-English Cleaned v1 Split
validation set self-reported

0.717
BLASER 2.0 ref on Luganda-English Cleaned v1 Split
validation set self-reported

3.762
BLASER 2.0 QE on Luganda-English Cleaned v1 Split
validation set self-reported

3.723