A newer version of this model is available: unsloth/orpheus-3b-0.1-ft

๐Ÿ—ฃ๏ธ Hmong TTS โ€” Orpheusโ€‘3B (Fineโ€‘Tuned) | LocalVoice.org


license: apache-2.0

Hmong Textโ€‘Toโ€‘Speech (TTS) model fineโ€‘tuned from Orpheusโ€‘3Bโ€‘TTS, optimized with Unsloth + SNAC codec. Built by LocalVoice.org to support Hmong language technology.

๐Ÿ™ Special thanks to ThaiSC & HPC Ignite Program for HPC resources.


๐ŸŒŸ Model Highlights

  • โš™๏ธ Base Model: Orpheusโ€‘3B TTS
  • ๐Ÿ”‰ Codec: SNAC 24kHz (hubertsiuzdak/snac_24khz)
  • ๐ŸŒ Language: Hmong (Hmoob / Hmong Daw)
  • ๐Ÿง  Finetuned using Unsloth PEFT LoRA
  • ๐ŸŽ™๏ธ Supports emotion tags: <giggle>, <laugh>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
  • ๐ŸŽญ Optional multiโ€‘speaker prompt prefix
  • โšก Realโ€‘time inference on a single GPU

๐Ÿงช Quick Inference Example

from unsloth import FastLanguageModel
import torch
from snac import SNAC

# === Load Language Model (4bit optional) ===
model_path = "Pakorn2112/Orpheus-3B-TTS-hmong/model-single-speaker"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length = 2048,
    dtype = None,            # Auto-detect precision
    load_in_4bit = False,    # Set True for 4-bit inference
)

# === Load SNAC codec ===
snac_path = "hubertsiuzdak/snac_24khz"
snac_model = SNAC.from_pretrained(snac_path).to("cuda")

# === Optional Voice ID (multi-speaker) ===
chosen_voice = 3   # Set None for singleโ€‘speaker

# === Emotion tags supported ===
# <giggle> <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>

prompts = [
    "kuv hu ua paj ntaub, <giggle> Koj lub npe hu li cas.",
]

# Enable fast inference
FastLanguageModel.for_inference(model)

# Move codec back to CPU to free GPU RAM
detect_cpu = snac_model.to("cpu")

๐ŸŽง Full Token Generation + Decoding (SNAC)

This script generates SNAC tokens and reconstructs audio. For full code, see: inference.py in this repository.

# (Token formatting, generation & decoding)
# Extract 128xxx audio tokens โ†’ reshape โ†’ decode via SNAC
# Full example in repository (same as provided in training logs)

๐Ÿ›‘ Note: Output tokens must be split into 7โ€‘tuple quantized layers before SNAC decoding.


๐ŸŽ™๏ธ Example Usage With Audio Output (IPython)

from IPython.display import display, Audio

# Generate & play audio
for i in range(len(prompts)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().cpu().numpy(), rate=24000))

๐Ÿ“Œ Recommended Dataset Format (metadata.json)

[
  {
    "audio": "wavs/001.wav",
    "text": "koj nyob li cas?",
    "speaker": "spk_f1"
  },
  {
    "audio": "wavs/002.wav",
    "text": "kuv nyob zoo ua tsaug.",
    "speaker": "spk_f1"
  }
]

๐Ÿ’ก Tips for Best Quality

  • Use 24kHz mono WAV recordings
  • Trim silence and remove heavy noise
  • Keep clips 1โ€‘8 seconds long per utterance
  • Use clear, natural speaking tone
  • Add optional emotion tokens for expressive voices

๐Ÿ“„ License

apache-2.0

This model is released publicly for research & educational use. Commercial applications may require dataset rights & additional review.


๐Ÿค Credits

  • Hmong TTS Model: LocalVoice.org
  • HPC Support: ThaiSC Supercomputer (LANTA) โ€” HPC Ignite Program
  • SNAC Codec Team: hubertsiuzdak (24kHz codec)
  • Fineโ€‘Tuning Framework: Unsloth

๐ŸŽ‰ Thank you for supporting Hmong language technology! ๐Ÿ–ค๐Ÿ’š๐Ÿ’™

Downloads last month
116
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Pakorn2112/Orpheus-3B-TTS-hmong