Instructions to use yigagilbert/stepaudio2-mini-luganda-english-s2st with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yigagilbert/stepaudio2-mini-luganda-english-s2st", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("yigagilbert/stepaudio2-mini-luganda-english-s2st", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yigagilbert/stepaudio2-mini-luganda-english-s2st" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yigagilbert/stepaudio2-mini-luganda-english-s2st", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yigagilbert/stepaudio2-mini-luganda-english-s2st
- SGLang
How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yigagilbert/stepaudio2-mini-luganda-english-s2st" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yigagilbert/stepaudio2-mini-luganda-english-s2st", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yigagilbert/stepaudio2-mini-luganda-english-s2st" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yigagilbert/stepaudio2-mini-luganda-english-s2st", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use yigagilbert/stepaudio2-mini-luganda-english-s2st with Docker Model Runner:
docker model run hf.co/yigagilbert/stepaudio2-mini-luganda-english-s2st
Step-Audio 2 Mini Luganda-to-English S2ST
This repository contains a full merged model for Luganda speech input to English speech
translation output. It was created by merging the LoRA adapter yigagilbert/stepaudio2-mini-luganda-english-s2st-lora into
stepfun-ai/Step-Audio-2-mini.
The separate adapter-only repository should remain available for users who prefer PEFT loading or want the smaller adapter artifact. This repository is intended for simpler deployment and inference where loading a single model repo is preferable.
Intended Use
Research and development for Luganda-to-English speech translation. Validate outputs with native speakers before production use.
Source Model and Adapter
- Base model:
stepfun-ai/Step-Audio-2-mini - LoRA adapter:
yigagilbert/stepaudio2-mini-luganda-english-s2st-lora - Merge script:
scripts/push_full_model_to_hub.py
Evaluation Summary
The table below summarizes the held-out validation evaluation used during development. All text metrics were computed on 200 validation examples. Speech metrics were computed on the aligned 197-example subset for which generated audio existed.
This full model was created by merging the LoRA adapter into the base model. The metrics below were generated with the adapter-loaded fine-tuned model before merge; the merged model contains the same adapted weights and is expected to match these results aside from normal deterministic or runtime differences. Re-run evaluation directly on this repository before a strict release if exact reproducibility is required.
Text and Semantic Metrics
| System | Loading Form | Count | BLEU higher | chrF higher | WER lower | COMET higher | BLASER ref higher | BLASER QE higher |
|---|---|---|---|---|---|---|---|---|
| Base Step-Audio-2-mini | Base only | 200 | 0.012 | 5.152 | 10.702 | 0.386 | 1.713 | 2.164 |
| Fine-tuned Step-Audio | Base + LoRA adapter | 200 | 32.530 | 54.535 | 0.574 | 0.717 | 3.762 | 3.723 |
| This full merged model | Same adapted weights, merged | 200 | 32.530* | 54.535* | 0.574* | 0.717* | 3.762* | 3.723* |
| Cascade baseline | ASR + MT + TTS | 200 | 36.778 | 57.971 | 0.521 | 0.737 | 3.839 | 3.776 |
* The full merged row reflects the adapter evaluation because this repository is
produced by folding the evaluated LoRA adapter into the same base model.
Speech Metrics
| System | Count | chrF higher | SpeechBERT P higher | SpeechBERT R higher | SpeechBERT F1 higher | MCD lower |
|---|---|---|---|---|---|---|
| Fine-tuned Step-Audio | 197 | 54.582 | 0.644 | 0.648 | 0.645 | 629.718 |
| This full merged model | 197 | 54.582* | 0.644* | 0.648* | 0.645* | 629.718* |
| Cascade baseline | 197 | 57.893 | 0.603 | 0.622 | 0.612 | 613.212 |
The unfine-tuned base model emitted no valid speech-token sequences in this 200-sample run, so SpeechBERTScore and MCD were not computed for it.
Interpretation
The fine-tuned Step-Audio model substantially improves over the base model and becomes a viable end-to-end Luganda-to-English speech translation system. The cascade remains stronger on text and semantic metrics, while the fine-tuned Step-Audio system is simpler to deploy and scored higher on the WavLM-based SpeechBERTScore F1 proxy.
Notes
If this repository includes token2wav/, those assets are provided to support waveform
synthesis from generated audio tokens. Some inference clients may still use the official
Step-Audio2 runtime code for token-to-waveform conversion.
License
The training code and adapter metadata are Apache-2.0. Because this merged repository contains base-model weights, users must also comply with the base model license and any dataset licensing constraints.
- Downloads last month
- 20
Model tree for yigagilbert/stepaudio2-mini-luganda-english-s2st
Base model
stepfun-ai/Step-Audio-2-miniEvaluation results
- BLEU on Luganda-English Cleaned v1 Splitvalidation set self-reported32.530
- chrF on Luganda-English Cleaned v1 Splitvalidation set self-reported54.535
- WER on generated English text on Luganda-English Cleaned v1 Splitvalidation set self-reported0.574
- COMET on Luganda-English Cleaned v1 Splitvalidation set self-reported0.717
- BLASER 2.0 ref on Luganda-English Cleaned v1 Splitvalidation set self-reported3.762
- BLASER 2.0 QE on Luganda-English Cleaned v1 Splitvalidation set self-reported3.723