You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

mbert-tibetan-continual-wylie-final

This repository is public.

Overview

This is a BERT model continually trained from bert-base-multilingual-cased on Tibetan data represented in Wylie transliteration.

It was trained as part of the Intelexsus project on a mixed Tibetan corpus that includes:

  • Tibetan text converted from the original Tibetan script (Unicode) into Wylie transliteration
  • Data that was originally authored in Wylie transliteration

The goal is to improve Tibetan representations for downstream tasks where Tibetan content is available or normalized in Wylie.

Model Details

  • Base model: bert-base-multilingual-cased
  • Language/Script: Tibetan via Wylie transliteration (bo)
  • Training objective: Masked Language Modeling (MLM)
  • Architecture: 12-layer, 768-hidden, 12-heads
  • Tokenizer: WordPiece tokenizer compatible with mBERT

How to Use

You can use this model directly with the transformers library for the fill-mask task.

from transformers import pipeline

model_name = "Intellexus/mbert-tibetan-continual-wylie-final"
unmasker = pipeline("fill-mask", model=model_name)

# Example sentence in Wylie transliteration (demonstrative only)
result = unmasker("bod yig la [MASK] yod do")
print(result)

You can also load the model and tokenizer directly for more control:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "OMRIDRORI/mbert-tibetan-continual-wylie-final"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# You can now use the model for your own fine-tuning and inference tasks.

Training Data

The continual training used a Tibetan corpus consisting of:

  • Tibetan Unicode-script text converted to Wylie transliteration
  • Native Wylie transliterated sources

This combination aims to support tasks where Tibetan content is handled in transliterated form.

Intended Use and Limitations

This model is intended for research and downstream tasks involving Tibetan in Wylie transliteration. It may contain biases present in the training data and may not perform well outside this domain.

Citation

If you use this model, please cite the Intelexsus project or link to the model page: https://huggingface.co/OMRIDRORI/mbert-tibetan-continual-wylie-final

Downloads last month
871
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Intellexus/mbert-tibetan-continual-wylie-final

Finetunes
10 models