You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Step-Audio 2 Mini Luganda-to-English S2ST

This repository contains a full merged model for Luganda speech input to English speech translation output. It was created by merging the LoRA adapter yigagilbert/stepaudio2-mini-luganda-english-s2st-lora into stepfun-ai/Step-Audio-2-mini.

The separate adapter-only repository should remain available for users who prefer PEFT loading or want the smaller adapter artifact. This repository is intended for simpler deployment and inference where loading a single model repo is preferable.

Intended Use

Research and development for Luganda-to-English speech translation. Validate outputs with native speakers before production use.

Source Model and Adapter

  • Base model: stepfun-ai/Step-Audio-2-mini
  • LoRA adapter: yigagilbert/stepaudio2-mini-luganda-english-s2st-lora
  • Merge script: scripts/push_full_model_to_hub.py

Evaluation Summary

The table below summarizes the held-out validation evaluation used during development. All text metrics were computed on 200 validation examples. Speech metrics were computed on the aligned 197-example subset for which generated audio existed.

This full model was created by merging the LoRA adapter into the base model. The metrics below were generated with the adapter-loaded fine-tuned model before merge; the merged model contains the same adapted weights and is expected to match these results aside from normal deterministic or runtime differences. Re-run evaluation directly on this repository before a strict release if exact reproducibility is required.

Text and Semantic Metrics

System Loading Form Count BLEU higher chrF higher WER lower COMET higher BLASER ref higher BLASER QE higher
Base Step-Audio-2-mini Base only 200 0.012 5.152 10.702 0.386 1.713 2.164
Fine-tuned Step-Audio Base + LoRA adapter 200 32.530 54.535 0.574 0.717 3.762 3.723
This full merged model Same adapted weights, merged 200 32.530* 54.535* 0.574* 0.717* 3.762* 3.723*
Cascade baseline ASR + MT + TTS 200 36.778 57.971 0.521 0.737 3.839 3.776

* The full merged row reflects the adapter evaluation because this repository is produced by folding the evaluated LoRA adapter into the same base model.

Speech Metrics

System Count chrF higher SpeechBERT P higher SpeechBERT R higher SpeechBERT F1 higher MCD lower
Fine-tuned Step-Audio 197 54.582 0.644 0.648 0.645 629.718
This full merged model 197 54.582* 0.644* 0.648* 0.645* 629.718*
Cascade baseline 197 57.893 0.603 0.622 0.612 613.212

The unfine-tuned base model emitted no valid speech-token sequences in this 200-sample run, so SpeechBERTScore and MCD were not computed for it.

Interpretation

The fine-tuned Step-Audio model substantially improves over the base model and becomes a viable end-to-end Luganda-to-English speech translation system. The cascade remains stronger on text and semantic metrics, while the fine-tuned Step-Audio system is simpler to deploy and scored higher on the WavLM-based SpeechBERTScore F1 proxy.

Notes

If this repository includes token2wav/, those assets are provided to support waveform synthesis from generated audio tokens. Some inference clients may still use the official Step-Audio2 runtime code for token-to-waveform conversion.

License

The training code and adapter metadata are Apache-2.0. Because this merged repository contains base-model weights, users must also comply with the base model license and any dataset licensing constraints.

Downloads last month
20
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yigagilbert/stepaudio2-mini-luganda-english-s2st

Quantized
(1)
this model

Evaluation results