A newer version of this model is available: unsloth/orpheus-3b-0.1-ft

🗣️ Hmong TTS — Orpheus‑3B (Fine‑Tuned) | LocalVoice.org

license: apache-2.0

Hmong Text‑To‑Speech (TTS) model fine‑tuned from Orpheus‑3B‑TTS, optimized with Unsloth + SNAC codec. Built by LocalVoice.org to support Hmong language technology.

🙏 Special thanks to ThaiSC & HPC Ignite Program for HPC resources.

🌟 Model Highlights

⚙️ Base Model: Orpheus‑3B TTS
🔉 Codec: SNAC 24kHz (hubertsiuzdak/snac_24khz)
🌍 Language: Hmong (Hmoob / Hmong Daw)
🧠 Finetuned using Unsloth PEFT LoRA
🎙️ Supports emotion tags: <giggle>, <laugh>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
🎭 Optional multi‑speaker prompt prefix
⚡ Real‑time inference on a single GPU

🧪 Quick Inference Example

from unsloth import FastLanguageModel
import torch
from snac import SNAC

# === Load Language Model (4bit optional) ===
model_path = "Pakorn2112/Orpheus-3B-TTS-hmong/model-single-speaker"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length = 2048,
    dtype = None,            # Auto-detect precision
    load_in_4bit = False,    # Set True for 4-bit inference
)

# === Load SNAC codec ===
snac_path = "hubertsiuzdak/snac_24khz"
snac_model = SNAC.from_pretrained(snac_path).to("cuda")

# === Optional Voice ID (multi-speaker) ===
chosen_voice = 3   # Set None for single‑speaker

# === Emotion tags supported ===
# <giggle> <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>

prompts = [
    "kuv hu ua paj ntaub, <giggle> Koj lub npe hu li cas.",
]

# Enable fast inference
FastLanguageModel.for_inference(model)

# Move codec back to CPU to free GPU RAM
detect_cpu = snac_model.to("cpu")

🎧 Full Token Generation + Decoding (SNAC)

This script generates SNAC tokens and reconstructs audio. For full code, see: inference.py in this repository.

# (Token formatting, generation & decoding)
# Extract 128xxx audio tokens → reshape → decode via SNAC
# Full example in repository (same as provided in training logs)

🛑 Note: Output tokens must be split into 7‑tuple quantized layers before SNAC decoding.

🎙️ Example Usage With Audio Output (IPython)

from IPython.display import display, Audio

# Generate & play audio
for i in range(len(prompts)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().cpu().numpy(), rate=24000))

📌 Recommended Dataset Format (metadata.json)

[
  {
    "audio": "wavs/001.wav",
    "text": "koj nyob li cas?",
    "speaker": "spk_f1"
  },
  {
    "audio": "wavs/002.wav",
    "text": "kuv nyob zoo ua tsaug.",
    "speaker": "spk_f1"
  }
]