--- language: - sat - en license: mit tags: - sentence-transformers - sentence-similarity - feature-extraction - low-resource - cross-lingual - garo - tibeto-burman - northeast-india datasets: - custom metrics: - cosine_similarity library_name: pytorch pipeline_tag: sentence-similarity --- # GaroEmbed: Cross-Lingual Sentence Embeddings for Garo **GaroEmbed** is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving **29.33% Top-1** and **65.33% Top-5** cross-lingual retrieval accuracy. ## Model Description - **Model Type**: BiLSTM Sentence Encoder with Contrastive Learning - **Language**: Garo (sat) ↔ English (en) - **Training Data**: 3,000 Garo-English parallel sentence pairs - **Base Embeddings**: GaroVec (FastText 300d with char n-grams) - **Output Dimension**: 384d (aligned with MiniLM) - **Parameters**: 10.7M - **Training Time**: ~15 minutes on RTX A4500 ## Performance | Metric | Score | |--------|-------| | Top-1 Accuracy | 29.33% | | Top-5 Accuracy | 65.33% | | Top-10 Accuracy | 72.67% | | Mean Reciprocal Rank | 0.4512 | | Avg Cosine Similarity | 0.3446 | **88x improvement** over mean-pooled GaroVec baseline (0.33% → 29.33% Top-1). ## Usage ### Requirements ```bash pip install torch fasttext-wheel sentence-transformers huggingface-hub ``` ### Loading the Model ```python import torch import torch.nn as nn import fasttext from huggingface_hub import hf_hub_download # Download model checkpoint checkpoint_path = hf_hub_download( repo_id="Badnyal/GaroEmbed", filename="garoembed_best.pt" ) # Download GaroVec embeddings (required) garovec_path = hf_hub_download( repo_id="MWirelabs/GaroVec", filename="garovec_garo.bin" ) # Load GaroVec garo_fasttext = fasttext.load_model(garovec_path) # Define model architecture (see model_architecture.py in repo) class GaroEmbed(nn.Module): def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3): super(GaroEmbed, self).__init__() self.embedding_dim = embedding_dim self.hidden_dim = hidden_dim self.output_dim = output_dim vocab_size = len(garo_fasttext_model.words) self.embedding = nn.Embedding(vocab_size, embedding_dim) weights = [] for word in garo_fasttext_model.words: weights.append(garo_fasttext_model.get_word_vector(word)) weights_tensor = torch.FloatTensor(weights) self.embedding.weight.data.copy_(weights_tensor) self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True) self.projection = nn.Linear(hidden_dim * 2, output_dim) self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)} self.fasttext_model = garo_fasttext_model def tokenize_and_encode(self, sentences): batch_indices = [] batch_lengths = [] for sentence in sentences: tokens = sentence.lower().split() indices = [] for token in tokens: if token in self.word2idx: indices.append(self.word2idx[token]) else: indices.append(0) if len(indices) == 0: indices = [0] batch_indices.append(indices) batch_lengths.append(len(indices)) return batch_indices, batch_lengths def forward(self, sentences): batch_indices, batch_lengths = self.tokenize_and_encode(sentences) max_len = max(batch_lengths) device = next(self.parameters()).device padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device) for i, indices in enumerate(batch_indices): padded[i, :len(indices)] = torch.LongTensor(indices) embedded = self.embedding(padded) packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False) lstm_out, (hidden, cell) = self.lstm(packed) forward_hidden = hidden[-2] backward_hidden = hidden[-1] combined = torch.cat([forward_hidden, backward_hidden], dim=1) sentence_embedding = self.projection(combined) sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1) return sentence_embedding # Initialize and load weights model = GaroEmbed(garo_fasttext, output_dim=384) checkpoint = torch.load(checkpoint_path, map_location='cpu') model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Encode Garo sentences garo_sentences = [ "Anga namjanika", "Rikgiparang kamko suala" ] with torch.no_grad(): embeddings = model(garo_sentences) print(f"Embeddings shape: {embeddings.shape}") # [2, 384] ``` ### Cross-Lingual Retrieval ```python from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity # Load English encoder (frozen anchor) english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # Encode Garo and English garo_texts = ["Anga namjanika", "Garo biapni dokana"] english_texts = ["I feel bad", "About Garo culture", "The weather is nice"] garo_embeds = model(garo_texts).detach().numpy() english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True) # Compute similarities similarities = cosine_similarity(garo_embeds, english_embeds) print("Garo-English similarities:") print(similarities) ``` ## Training Details - **Architecture**: 2-layer BiLSTM (512 hidden units) + Linear projection - **Loss**: InfoNCE contrastive loss (temperature=0.07) - **Optimizer**: Adam (lr=2×10⁻⁴) - **Batch Size**: 32 - **Epochs**: 20 - **Regularization**: Dropout 0.3, frozen GaroVec embeddings - **English Anchor**: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2) ## Limitations - Trained on only 3,000 parallel pairs (limited semantic coverage) - Domain: Daily conversation and cultural topics (lacks technical/literary language) - Orthography: Latin script only - Morphology: Does not explicitly model Garo's agglutinative structure - Evaluation: Limited to retrieval tasks ## Acknowledgments - Built on [GaroVec](https://huggingface.co/MWirelabs/GaroVec) word embeddings - English anchor: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) - Developed at [MWire Labs](https://mwirelabs.com) ## License MIT License - Free for research and commercial use ## Contact - **Author**: Badal Nyalang - **Organization**: MWire Labs - **Repository**: [https://huggingface.co/Badnyal/GaroEmbed](https://huggingface.co/Badnyal/GaroEmbed) --- *First neural sentence embedding model for Garo language • Enabling NLP for low-resource Tibeto-Burman languages*