A newer version of this model is available:
unsloth/orpheus-3b-0.1-ft
๐ฃ๏ธ Hmong TTS โ Orpheusโ3B (FineโTuned) | LocalVoice.org
license: apache-2.0
Hmong TextโToโSpeech (TTS) model fineโtuned from Orpheusโ3BโTTS, optimized with Unsloth + SNAC codec. Built by LocalVoice.org to support Hmong language technology.
๐ Special thanks to ThaiSC & HPC Ignite Program for HPC resources.
๐ Model Highlights
- โ๏ธ Base Model: Orpheusโ3B TTS
- ๐ Codec: SNAC 24kHz (hubertsiuzdak/snac_24khz)
- ๐ Language: Hmong (Hmoob / Hmong Daw)
- ๐ง Finetuned using Unsloth PEFT LoRA
- ๐๏ธ Supports emotion tags:
<giggle>,<laugh>,<sigh>,<cough>,<sniffle>,<groan>,<yawn>,<gasp> - ๐ญ Optional multiโspeaker prompt prefix
- โก Realโtime inference on a single GPU
๐งช Quick Inference Example
from unsloth import FastLanguageModel
import torch
from snac import SNAC
# === Load Language Model (4bit optional) ===
model_path = "Pakorn2112/Orpheus-3B-TTS-hmong/model-single-speaker"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = model_path,
max_seq_length = 2048,
dtype = None, # Auto-detect precision
load_in_4bit = False, # Set True for 4-bit inference
)
# === Load SNAC codec ===
snac_path = "hubertsiuzdak/snac_24khz"
snac_model = SNAC.from_pretrained(snac_path).to("cuda")
# === Optional Voice ID (multi-speaker) ===
chosen_voice = 3 # Set None for singleโspeaker
# === Emotion tags supported ===
# <giggle> <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>
prompts = [
"kuv hu ua paj ntaub, <giggle> Koj lub npe hu li cas.",
]
# Enable fast inference
FastLanguageModel.for_inference(model)
# Move codec back to CPU to free GPU RAM
detect_cpu = snac_model.to("cpu")
๐ง Full Token Generation + Decoding (SNAC)
This script generates SNAC tokens and reconstructs audio. For full code, see:
inference.pyin this repository.
# (Token formatting, generation & decoding)
# Extract 128xxx audio tokens โ reshape โ decode via SNAC
# Full example in repository (same as provided in training logs)
๐ Note: Output tokens must be split into 7โtuple quantized layers before SNAC decoding.
๐๏ธ Example Usage With Audio Output (IPython)
from IPython.display import display, Audio
# Generate & play audio
for i in range(len(prompts)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().cpu().numpy(), rate=24000))
๐ Recommended Dataset Format (metadata.json)
[
{
"audio": "wavs/001.wav",
"text": "koj nyob li cas?",
"speaker": "spk_f1"
},
{
"audio": "wavs/002.wav",
"text": "kuv nyob zoo ua tsaug.",
"speaker": "spk_f1"
}
]
๐ก Tips for Best Quality
- Use 24kHz mono WAV recordings
- Trim silence and remove heavy noise
- Keep clips 1โ8 seconds long per utterance
- Use clear, natural speaking tone
- Add optional emotion tokens for expressive voices
๐ License
apache-2.0
This model is released publicly for research & educational use. Commercial applications may require dataset rights & additional review.
๐ค Credits
- Hmong TTS Model: LocalVoice.org
- HPC Support: ThaiSC Supercomputer (LANTA) โ HPC Ignite Program
- SNAC Codec Team: hubertsiuzdak (24kHz codec)
- FineโTuning Framework: Unsloth
๐ Thank you for supporting Hmong language technology! ๐ค๐๐
- Downloads last month
- 116
Model tree for Pakorn2112/Orpheus-3B-TTS-hmong
Base model
meta-llama/Llama-3.2-3B-Instruct
Finetuned
canopylabs/orpheus-3b-0.1-pretrained
Finetuned
canopylabs/orpheus-3b-0.1-ft
Finetuned
unsloth/orpheus-3b-0.1-ft