Spaces:

uBaBy4knife
/

BanglaFeel

Sleeping

App Files Files Community

uBaby4life commited on May 11, 2025

Commit

3785cde

1 Parent(s): 22f37d2

Add Flask application with Docker setup for transliterator

Browse files

Files changed (14) hide show

Dockerfile +42 -0
README.md +33 -6
app.py +627 -0
model_files/tokenizers/encoders_tokenizer/special_tokens_map.json +37 -0
model_files/tokenizers/encoders_tokenizer/tokenizer.json +0 -0
model_files/tokenizers/encoders_tokenizer/tokenizer_config.json +63 -0
model_files/tokenizers/encoders_tokenizer/vocab.txt +0 -0
model_files/tokenizers/t5_tokenizer/added_tokens.json +130 -0
model_files/tokenizers/t5_tokenizer/special_tokens_map.json +135 -0
model_files/tokenizers/t5_tokenizer/spiece.model +3 -0
model_files/tokenizers/t5_tokenizer/tokenizer_config.json +1169 -0
requirements.txt +7 -0
static/style.css +274 -0
templates/index.html +144 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,42 @@

+# Start from a Python base image
+FROM python:3.9-slim
+# Set environment variables
+ENV PYTHONUNBUFFERED=1 \
+    # Ensures that Python output is sent straight to terminal without being first buffered
+    # and that can be helpful for logging.
+    PIP_NO_CACHE_DIR=off \
+    # Disables pip caching, which can reduce image size.
+    PIP_DISABLE_PIP_VERSION_CHECK=on \
+    # Disables the check for a new version of pip, speeding up builds.
+    PIP_DEFAULT_TIMEOUT=100 \
+    # Increases the default timeout for pip.
+    HF_HUB_DISABLE_SYMLINKS_WARNING=1
+    # To suppress the symlink warning from huggingface_hub
+# Create a non-root user and switch to it
+RUN useradd -m -u 1000 user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH" # Add user's local bin to PATH
+# Set the working directory in the container
+WORKDIR /app
+# Copy requirements.txt first to leverage Docker cache
+COPY --chown=user ./requirements.txt requirements.txt
+# Install dependencies
+# Using --no-cache-dir to reduce image size further
+RUN pip install --no-cache-dir --upgrade -r requirements.txt
+# Copy the rest of the application code into the container
+# This includes app.py, model_files/, static/, templates/, LICENSE, README.md
+COPY --chown=user . .
+# Expose the port the app will run on. HF Spaces expects 7860 for Docker.
+EXPOSE 7860
+# Command to run the application using Gunicorn
+# It will listen on all interfaces (0.0.0.0) on port 7860.
+# app:app means "in the file app.py, use the Flask instance named app".
+CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "1", "--threads", "2", "--timeout", "0", "app:app"]

README.md CHANGED Viewed

@@ -1,12 +1,39 @@
 ---
-title: BanglaFeel
-emoji: 🚀
-colorFrom: red
-colorTo: purple
 sdk: docker
 pinned: false
 license: apache-2.0
-short_description: A customized Back Transliteration Model
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: BanglaFeel Translator
+emoji: 🌍💬
+colorFrom: blue
+colorTo: green
 sdk: docker
 pinned: false
 license: apache-2.0
+app_port: 7860 # IMPORTANT: Tell Hugging Face which port your app EXPOSES
 ---
+# BanglaFeel Translator
+A Flask web application for English to Bengali transliteration using a custom-trained DualEncoderDecoder model.
+## How to Use
+Visit the deployed Space URL and type or paste English text into the input box. Click "Translate" to see the Bengali transliteration.
+## Model Details
+This model is a custom architecture (DualEncoderDecoder) combining T5 (csebuetnlp/banglat5) with a hybrid character CNN and word LSTM encoder.
+*   **Base T5 Model:** `csebuetnlp/banglat5`
+*   **Base Encoder Tokenizer:** `csebuetnlp/banglabert`
+*   **Custom Components:** CharCNN, WordLSTM, HybridEncoder
+*   Trained for English to Bengali transliteration.
+## Intended Uses & Limitations
+*   **Intended Use:** Transliteration of English text (phonetically representing Bengali words) into Bengali script.
+*   **Limitations:**
+    *   May not handle all English phonetic variations perfectly.
+    *   Performance depends on the training data.
+    *   Currently handles inputs up to 500 characters.
+    *   The free hosting tier might experience cold starts.
+## License
+The code and model are licensed under the Apache License 2.0. See the `LICENSE` file for details.

app.py ADDED Viewed

	@@ -0,0 +1,627 @@

+import os
+import sys
+import random
+import torch
+import numpy as np
+from flask import Flask, request, jsonify, render_template
+os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import T5Tokenizer, AutoTokenizer, T5ForConditionalGeneration
+from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions
+# Get the directory of the current script (app.py)
+APP_ROOT = os.path.dirname(os.path.abspath(__file__))
+MODEL_FILES_DIR = os.path.join(APP_ROOT, 'model_files') # Path to your model_files directory
+# Ensure CFG.device is set to CPU for Hugging Face Spaces free tier
+# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Original
+device = torch.device('cpu') # MODIFIED FOR HF SPACES
+class CFG:
+    model_name = 'csebuetnlp/banglat5' # This is used for initial T5 model loading
+    encoder_name = 'csebuetnlp/banglabert' # This is used for initial encoder tokenizer loading
+    batch_size = 1
+    max_len = 512
+    seed = 42
+    device = device # Use the modified device
+# ... (rest of your imports and set_seed function, CharCNNEncoder, WordLSTMEncoder, HybridEncoder, DualEncoderDecoder classes remain the same)
+# Ensure these classes are present in your actual app.py
+# The initial tokenizer loading below will try to download from the hub.
+# This is okay, as load_checkpoint will later load your specific saved tokenizers
+# from local files using local_files_only=True.
+# If you wanted to avoid ANY hub download, you'd need to ensure your model_files/tokenizers
+# are sufficient for T5Tokenizer.from_pretrained to work with local_files_only=True
+# from the very start, which might require more config files in those dirs.
+# For now, this setup is fine.
+CFG.t5_tokenizer = T5Tokenizer.from_pretrained(
+    CFG.model_name,
+    legacy=False,
+    model_max_length=CFG.max_len
+)
+if CFG.t5_tokenizer.pad_token is None:
+    CFG.t5_tokenizer.pad_token = CFG.t5_tokenizer.eos_token
+if CFG.t5_tokenizer.bos_token is None:
+    CFG.t5_tokenizer.bos_token = CFG.t5_tokenizer.eos_token
+CFG.encoder_tokenizer = AutoTokenizer.from_pretrained(
+    CFG.encoder_name,
+    model_max_length=CFG.max_len
+)
+if CFG.encoder_tokenizer.pad_token is None:
+    CFG.encoder_tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # This line might change vocab size if not already done
+    CFG.encoder_tokenizer.pad_token = '[PAD]'
+def compute_max_char_len(texts): # Ensure this function is present
+    # ... (your implementation)
+    # If not used, you can remove it, but it was in your original code
+    # A placeholder if it was just for training:
+    if not texts or not any(isinstance(text, str) for text in texts):
+        return 50 # Default or error
+    return max(
+        len(word)
+        for text in texts
+        if isinstance(text, str)
+        for word in text.split()
+    ) if any(text for text in texts if isinstance(text, str)) else 50
+# --- START: PASTE YOUR CharCNNEncoder, WordLSTMEncoder, HybridEncoder, DualEncoderDecoder classes here ---
+# (As provided in your original app.py)
+# Make sure they are correctly defined before load_checkpoint
+class CharCNNEncoder(nn.Module):
+    def __init__(self, char_vocab_size, char_embedding_dim, char_cnn_output_dim, kernel_sizes, num_filters, dropout=0.1):
+        super(CharCNNEncoder, self).__init__()
+        self.char_embedding = nn.Embedding(char_vocab_size, char_embedding_dim, padding_idx=0)
+        self.conv_layers = nn.ModuleList()
+        for ks, nf in zip(kernel_sizes, num_filters):
+            self.conv_layers.append(
+                nn.Sequential(
+                    nn.Conv1d(char_embedding_dim, nf, kernel_size=ks, padding=ks // 2),
+                    nn.ReLU(),
+                    nn.AdaptiveMaxPool1d(1)
+                )
+            )
+        self.dropout = nn.Dropout(dropout)
+        self.output_projection = nn.Linear(sum(num_filters), char_cnn_output_dim)
+    def forward(self, char_input):
+        batch_size, seq_len, char_len = char_input.size()
+        char_input = char_input.view(-1, char_len)
+        char_emb = self.char_embedding(char_input)
+        char_emb = char_emb.permute(0, 2, 1)
+        conv_outputs = [conv(char_emb) for conv in self.conv_layers]
+        concat_output = torch.cat(conv_outputs, dim=1)
+        concat_output = concat_output.squeeze(-1)
+        concat_output = self.dropout(concat_output)
+        char_cnn_output = self.output_projection(concat_output)
+        char_cnn_output = char_cnn_output.view(batch_size, seq_len, -1)
+        return char_cnn_output
+class WordLSTMEncoder(nn.Module):
+    def __init__(self, word_vocab_size, word_embedding_dim, word_lstm_hidden_dim, num_lstm_layers, dropout):
+        super(WordLSTMEncoder, self).__init__()
+        # Ensure CFG.encoder_tokenizer is loaded before this class is instantiated if padding_idx relies on it.
+        # The current code structure loads CFG.encoder_tokenizer globally first.
+        padding_idx_val = CFG.encoder_tokenizer.pad_token_id if hasattr(CFG, 'encoder_tokenizer') and CFG.encoder_tokenizer.pad_token_id is not None else 0
+        self.word_embedding = nn.Embedding(
+            word_vocab_size,
+            word_embedding_dim,
+            padding_idx=padding_idx_val
+        )
+        self.lstm = nn.LSTM(
+            word_embedding_dim,
+            word_lstm_hidden_dim,
+            num_layers=num_lstm_layers,
+            batch_first=True,
+            dropout=dropout,
+            bidirectional=True
+        )
+        self.output_projection = nn.Linear(2 * word_lstm_hidden_dim, word_lstm_hidden_dim)
+    def forward(self, word_input, sequence_lengths):
+        batch_size = word_input.size(0)
+        word_emb = self.word_embedding(word_input)
+        # Ensure sequence_lengths is on CPU for sorting and pack_padded_sequence
+        sequence_lengths_cpu = sequence_lengths.cpu()
+        # Handle cases where all sequence lengths might be zero, which can cause issues with sorting
+        if torch.all(sequence_lengths_cpu == 0):
+            # If all lengths are 0, LSTM output will be zeros.
+            # We need to create zero tensors of the expected shape.
+            # This is a simplified handling; a more robust solution might be needed
+            # depending on how zero-length sequences are meant to be processed.
+            lstm_out = torch.zeros(batch_size, word_input.size(1), self.lstm.hidden_size * 2, device=word_input.device)
+            hidden = torch.zeros(batch_size, self.lstm.hidden_size * 2, device=word_input.device)
+            return self.output_projection(hidden), lstm_out
+        sorted_lengths, sort_idx = sequence_lengths_cpu.sort(0, descending=True)
+        sorted_word_emb = word_emb[sort_idx]
+        # Filter out zero-length sequences before packing if pack_padded_sequence requires it
+        # For PyTorch versions where pack_padded_sequence handles zero lengths in sorted_lengths:
+        packed_word_emb = nn.utils.rnn.pack_padded_sequence(
+            sorted_word_emb,
+            sorted_lengths.clamp(min=1), # Ensure lengths are at least 1 for packing if issues arise
+            batch_first=True,
+            enforce_sorted=True # This is important
+        )
+        packed_lstm_out, (hidden_state, cell_state) = self.lstm(packed_word_emb)
+        lstm_out, _ = nn.utils.rnn.pad_packed_sequence(
+            packed_lstm_out,
+            batch_first=True,
+            total_length=word_input.size(1)
+        )
+        _, unsort_idx = sort_idx.sort(0)
+        lstm_out = lstm_out[unsort_idx]
+        # Process hidden state correctly for bidirectional LSTM
+        # hidden_state is (num_layers * num_directions, batch, hidden_size)
+        # We want the last layer's hidden states (forward and backward)
+        hidden_state = hidden_state.view(self.lstm.num_layers, 2, batch_size, self.lstm.hidden_size) # 2 for bidirectional
+        hidden_state_last_layer = hidden_state[-1] # Get the last layer
+        # Concatenate forward and backward hidden states: (batch, 2 * hidden_size)
+        final_hidden = torch.cat((hidden_state_last_layer[0], hidden_state_last_layer[1]), dim=1)
+        final_hidden = final_hidden[unsort_idx] # Unsort to original batch order
+        return self.output_projection(final_hidden), lstm_out
+class HybridEncoder(nn.Module):
+    def __init__(self, char_cnn_encoder, word_lstm_encoder, hybrid_encoder_output_dim):
+        super(HybridEncoder, self).__init__()
+        self.char_cnn_encoder = char_cnn_encoder
+        self.word_lstm_encoder = word_lstm_encoder
+        self.char_hidden_size = char_cnn_encoder.output_projection.out_features
+        # For bidirectional LSTM, output is 2 * hidden_dim from WordLSTMEncoder's projection layer
+        self.lstm_projected_hidden_size = word_lstm_encoder.output_projection.out_features # This should be word_lstm_hidden_dim
+        # The actual output from LSTM itself before projection is 2 * lstm.hidden_size for sequence outputs
+        self.lstm_sequence_output_size = word_lstm_encoder.lstm.hidden_size * 2
+        # The output_projection should combine char_cnn_output and the sequence output of LSTM
+        self.output_projection = nn.Linear(self.char_hidden_size + self.lstm_sequence_output_size, hybrid_encoder_output_dim)
+    def forward(self, char_input, word_input, sequence_lengths):
+        batch_size = char_input.size(0)
+        max_seq_len = word_input.size(1) # Assuming word_input determines max_seq_len
+        char_cnn_output = self.char_cnn_encoder(char_input) # (batch_size, char_seq_len, char_cnn_output_dim)
+        # Ensure sequence_lengths is on the same device as the model/input
+        sequence_lengths = sequence_lengths.to(word_input.device)
+        _, lstm_sequence_output = self.word_lstm_encoder(word_input, sequence_lengths) # (batch_size, word_seq_len, 2 * lstm_hidden_dim)
+        # Pad/truncate char_cnn_output and lstm_sequence_output to a common max_seq_len if they differ
+        # This assumes char_input and word_input might correspond to different tokenization granularities
+        # For simplicity, let's assume they are aligned or word_input's seq_len is the target.
+        # Pad CharCNN outputs if its sequence length is less than max_seq_len from word_input
+        if char_cnn_output.size(1) < max_seq_len:
+            padding_size = max_seq_len - char_cnn_output.size(1)
+            char_cnn_output = F.pad(char_cnn_output, (0, 0, 0, padding_size), "constant", 0)
+        elif char_cnn_output.size(1) > max_seq_len:
+            char_cnn_output = char_cnn_output[:, :max_seq_len, :]
+        # Pad LSTM outputs if its sequence length is less than max_seq_len (should not happen if total_length in pad_packed_sequence is max_seq_len)
+        # This check is more of a safeguard.
+        if lstm_sequence_output.size(1) < max_seq_len:
+            padding_size = max_seq_len - lstm_sequence_output.size(1)
+            lstm_sequence_output = F.pad(lstm_sequence_output, (0, 0, 0, padding_size), "constant", 0)
+        elif lstm_sequence_output.size(1) > max_seq_len:
+            lstm_sequence_output = lstm_sequence_output[:, :max_seq_len, :]
+        hybrid_output_concat = torch.cat((char_cnn_output, lstm_sequence_output), dim=2)
+        hybrid_encoder_output = self.output_projection(hybrid_output_concat)
+        return hybrid_encoder_output
+class DualEncoderDecoder(nn.Module):
+    def __init__(self, t5_model_name, hybrid_encoder, t5_tokenizer, freeze_t5=False):
+        super(DualEncoderDecoder, self).__init__()
+        self.t5 = T5ForConditionalGeneration.from_pretrained(t5_model_name)
+        self.t5_tokenizer = t5_tokenizer # Store tokenizer if needed for vocab size etc.
+        self.hybrid_encoder = hybrid_encoder
+        encoder_hidden_size = self.t5.config.d_model
+        hybrid_hidden_size = hybrid_encoder.output_projection.out_features # This is hybrid_encoder_output_dim
+        self.encoder_projection = nn.Linear(encoder_hidden_size + hybrid_hidden_size, encoder_hidden_size)
+        if freeze_t5:
+            for param in self.t5.parameters():
+                param.requires_grad = False
+        # Resize T5 token embeddings if tokenizer vocab size changed (e.g., by adding special tokens)
+        # This should ideally be done *after* loading CFG.t5_tokenizer in load_checkpoint
+        # if CFG.t5_tokenizer is the one tied to the model.
+        # self.t5.resize_token_embeddings(len(self.t5_tokenizer)) # Moved to load_checkpoint
+    def forward(self, input_ids, attention_mask, char_input, word_input, sequence_lengths, labels=None):
+        # T5 Encoder
+        t5_encoder_outputs_dict = self.t5.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
+        t5_encoder_last_hidden_state = t5_encoder_outputs_dict.last_hidden_state # (batch, seq_len_t5, d_model)
+        # Hybrid Encoder
+        # Ensure sequence_lengths is on the correct device for hybrid_encoder
+        sequence_lengths = sequence_lengths.to(char_input.device)
+        hybrid_encoder_output = self.hybrid_encoder(char_input, word_input, sequence_lengths) # (batch, seq_len_hybrid, hybrid_output_dim)
+        # Determine common sequence length for concatenation
+        # Typically, input_ids for T5 and word_input for hybrid encoder should have compatible sequence lengths.
+        # If they are from different tokenizations, alignment or choosing one as primary is needed.
+        # Assuming t5_encoder_last_hidden_state's seq_len is the target.
+        common_seq_len = t5_encoder_last_hidden_state.size(1)
+        # Pad or truncate hybrid_encoder_output to match common_seq_len
+        if hybrid_encoder_output.size(1) < common_seq_len:
+            padding_size = common_seq_len - hybrid_encoder_output.size(1)
+            hybrid_encoder_output = F.pad(hybrid_encoder_output, (0, 0, 0, padding_size), "constant", 0)
+        elif hybrid_encoder_output.size(1) > common_seq_len:
+            hybrid_encoder_output = hybrid_encoder_output[:, :common_seq_len, :]
+        # Concatenate along the feature dimension
+        concat_encoder_output = torch.cat((t5_encoder_last_hidden_state, hybrid_encoder_output), dim=2)
+        projected_encoder_output = self.encoder_projection(concat_encoder_output) # (batch, common_seq_len, d_model)
+        # Create BaseModelOutputWithPastAndCrossAttentions for T5 decoder
+        # The attention_mask here should correspond to the projected_encoder_output.
+        # If t5_encoder_last_hidden_state's seq_len was used, its attention_mask is appropriate.
+        encoder_outputs_for_decoder = BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=projected_encoder_output,
+            # past_key_values=None, # T5 internal
+            # hidden_states=None, # T5 internal
+            # attentions=None # T5 internal
+        )
+        # T5 Decoder
+        # The `attention_mask` passed to the T5 model here is for the *decoder's* cross-attention
+        # to the `encoder_outputs_for_decoder`. So it should match the sequence length of `projected_encoder_output`.
+        decoder_outputs = self.t5(
+            encoder_outputs=encoder_outputs_for_decoder, # Pass the combined & projected outputs
+            attention_mask=attention_mask, # This is the original T5 input attention mask, matching its seq_len
+            labels=labels,
+            return_dict=True,
+            use_cache=False # Important for training, can be True for faster inference if handled
+        )
+        return decoder_outputs
+    def generate(self, input_ids, attention_mask, char_input, word_input, sequence_lengths, max_length, num_beams):
+        # Similar to forward pass for encoder part
+        t5_encoder_outputs_dict = self.t5.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
+        t5_encoder_last_hidden_state = t5_encoder_outputs_dict.last_hidden_state
+        sequence_lengths = sequence_lengths.to(char_input.device)
+        hybrid_encoder_output = self.hybrid_encoder(char_input, word_input, sequence_lengths)
+        common_seq_len = t5_encoder_last_hidden_state.size(1)
+        if hybrid_encoder_output.size(1) < common_seq_len:
+            padding_size = common_seq_len - hybrid_encoder_output.size(1)
+            hybrid_encoder_output = F.pad(hybrid_encoder_output, (0, 0, 0, padding_size), "constant", 0)
+        elif hybrid_encoder_output.size(1) > common_seq_len:
+            hybrid_encoder_output = hybrid_encoder_output[:, :common_seq_len, :]
+        concat_encoder_output = torch.cat((t5_encoder_last_hidden_state, hybrid_encoder_output), dim=2)
+        projected_encoder_output = self.encoder_projection(concat_encoder_output)
+        encoder_outputs_for_generate = BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=projected_encoder_output
+        )
+        # Use T5's generate method
+        generated_ids_dict = self.t5.generate(
+            encoder_outputs=encoder_outputs_for_generate,
+            attention_mask=attention_mask, # Original T5 input attention mask
+            max_length=max_length,
+            num_beams=num_beams,
+            early_stopping=True,
+            use_cache=True, # Can be True for generation
+            return_dict_in_generate=True, # Ensures output is a dict-like object
+            # eos_token_id=self.t5_tokenizer.eos_token_id, # Good practice
+            # pad_token_id=self.t5_tokenizer.pad_token_id  # Good practice
+        )
+        return generated_ids_dict.sequences # .sequences attribute contains the generated token ids
+def load_checkpoint(path_to_checkpoint_file):
+    # path_to_checkpoint_file is now the full path to best_model.pth
+    if not os.path.exists(path_to_checkpoint_file):
+        print("No checkpoint file found at:", path_to_checkpoint_file)
+        # sys.exit(1) # Avoid exiting in a web app, raise an error or handle
+        raise FileNotFoundError(f"No checkpoint file found at: {path_to_checkpoint_file}")
+    print(f"Loading checkpoint from: {path_to_checkpoint_file}")
+    checkpoint = torch.load(path_to_checkpoint_file, map_location=CFG.device)
+    # checkpoint_dir is the directory containing best_model.pth, which is MODEL_FILES_DIR
+    checkpoint_dir = os.path.dirname(path_to_checkpoint_file)
+    # Path to the 'tokenizers' subdirectory within checkpoint_dir
+    tokenizer_base_save_path = os.path.join(checkpoint_dir, 'tokenizers')
+    t5_tokenizer_dir_path = os.path.join(tokenizer_base_save_path, 't5_tokenizer')
+    encoder_tokenizer_dir_path = os.path.join(tokenizer_base_save_path, 'encoders_tokenizer')
+    if not os.path.isdir(t5_tokenizer_dir_path):
+        raise FileNotFoundError(
+            f"T5 tokenizer directory not found: {t5_tokenizer_dir_path}. "
+            "Ensure tokenizers were saved correctly (e.g., using tokenizer.save_pretrained())."
+        )
+    if not os.path.isdir(encoder_tokenizer_dir_path):
+        raise FileNotFoundError(
+            f"Encoder tokenizer directory not found: {encoder_tokenizer_dir_path}. "
+            "Ensure tokenizers were saved correctly."
+        )
+    print(f"Loading T5 tokenizer from: {t5_tokenizer_dir_path}")
+    CFG.t5_tokenizer = T5Tokenizer.from_pretrained(
+        t5_tokenizer_dir_path,
+        legacy=False,
+        model_max_length=CFG.max_len,
+        local_files_only=True
+    )
+    if CFG.t5_tokenizer.pad_token is None: CFG.t5_tokenizer.pad_token = CFG.t5_tokenizer.eos_token
+    if CFG.t5_tokenizer.bos_token is None: CFG.t5_tokenizer.bos_token = CFG.t5_tokenizer.eos_token
+    print(f"Loading encoder tokenizer from: {encoder_tokenizer_dir_path}")
+    CFG.encoder_tokenizer = AutoTokenizer.from_pretrained(
+        encoder_tokenizer_dir_path,
+        model_max_length=CFG.max_len,
+        local_files_only=True
+    )
+    if CFG.encoder_tokenizer.pad_token is None:
+        print(f"Warning: Loaded encoder tokenizer from {encoder_tokenizer_dir_path} has no pad_token defined in its config.")
+        # If it was added during training and saved, it should be there.
+        # If it's missing, and your WordLSTMEncoder relies on a specific pad_token_id,
+        # you might need to manually set it here if the saved config doesn't have it.
+        # e.g., CFG.encoder_tokenizer.pad_token = '[PAD]'
+        #       CFG.encoder_tokenizer.pad_token_id = CFG.encoder_tokenizer.convert_tokens_to_ids('[PAD]')
+        # However, if add_special_tokens({'pad_token': '[PAD]'}) was called before saving,
+        # it should be part of the saved tokenizer's configuration.
+    loaded_config_from_checkpoint = checkpoint['config'] # Renamed to avoid conflict
+    loaded_char_to_id = checkpoint['char_to_id']
+    loaded_id_to_char = checkpoint['id_to_char']
+    model_architecture = checkpoint['model_architecture']
+    # Update CFG with specifics from the loaded config if they exist
+    for key, value in loaded_config_from_checkpoint.items():
+        setattr(CFG, key, value)
+    # CRITICAL: Re-assign CFG.device after loading config, in case it was saved in checkpoint
+    # Or better, ensure device is not part of saved 'config' if you want to control it externally.
+    # For HF Spaces, we want CPU.
+    CFG.device = device # Ensure our desired device (CPU) is set
+    loaded_max_char_len = model_architecture.get('max_char_len', 50) # Default if not in checkpoint
+    # Re-initialize model components with parameters from the checkpoint
+    char_cnn_encoder = CharCNNEncoder(
+        char_vocab_size=model_architecture['char_vocab_size'],
+        char_embedding_dim=model_architecture['char_embedding_dim'],
+        char_cnn_output_dim=model_architecture['char_cnn_output_dim'],
+        kernel_sizes=model_architecture['kernel_sizes'],
+        num_filters=model_architecture['num_filters'],
+        dropout=model_architecture.get('dropout', 0.1) # Use .get for robustness
+    )
+    # Ensure word_vocab_size matches the re-loaded encoder_tokenizer's vocab size
+    # The one in model_architecture was from training time.
+    current_encoder_vocab_size = len(CFG.encoder_tokenizer)
+    if model_architecture['word_vocab_size'] != current_encoder_vocab_size:
+        print(f"Warning: Word vocab size mismatch. Checkpoint: {model_architecture['word_vocab_size']}, "
+              f"Loaded CFG.encoder_tokenizer: {current_encoder_vocab_size}. Using loaded tokenizer's size.")
+    word_lstm_encoder = WordLSTMEncoder(
+        word_vocab_size=current_encoder_vocab_size, # Use current vocab size
+        word_embedding_dim=model_architecture['word_embedding_dim'],
+        word_lstm_hidden_dim=model_architecture['word_lstm_hidden_dim'],
+        num_lstm_layers=model_architecture['num_lstm_layers'],
+        dropout=model_architecture.get('dropout', 0.1)
+    )
+    hybrid_encoder = HybridEncoder(
+        char_cnn_encoder,
+        word_lstm_encoder,
+        hybrid_encoder_output_dim=model_architecture['hybrid_encoder_output_dim']
+    )
+    # Use the model_name from the *checkpoint's config* for T5 base
+    # This ensures consistency with the trained model's base.
+    model_base_name_for_t5 = loaded_config_from_checkpoint.get('model_name', CFG.model_name)
+    print(f"Initializing DualEncoderDecoder with T5 base: {model_base_name_for_t5}")
+    model = DualEncoderDecoder(
+        t5_model_name=model_base_name_for_t5,
+        hybrid_encoder=hybrid_encoder,
+        t5_tokenizer=CFG.t5_tokenizer # Pass the loaded T5 tokenizer
+    )
+    # Resize T5 token embeddings based on the *loaded* CFG.t5_tokenizer
+    # This is important if CFG.t5_tokenizer (loaded from local files) has a different vocab size
+    # than the one from T5ForConditionalGeneration.from_pretrained(model_base_name_for_t5)
+    model.t5.resize_token_embeddings(len(CFG.t5_tokenizer))
+    print("Loading model state_dict...")
+    # Use strict=False if you have intentional mismatches, e.g., if encoder_tokenizer vocab changed
+    # and WordLSTM embedding size changed. Otherwise, strict=True is safer.
+    # Given the warnings and adjustments for vocab sizes, strict=False might be necessary
+    # if the embedding layer for WordLSTMEncoder was reinitialized with a different size.
+    model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    model.to(CFG.device)
+    model.eval()
+    print("Model loaded successfully.")
+    return model, loaded_char_to_id, loaded_id_to_char, loaded_max_char_len
+# -------------
+# Helper methods (tokenize_characters, pad_sequence, process_input)
+# Ensure these are present and correct as per your original file
+# -------------
+def tokenize_characters(word, char_to_id): # Ensure char_to_id has <UNK> and <PAD>
+    # Default to <UNK> if char_to_id is not available or char not found
+    unk_token_id = char_to_id.get("<UNK>", 0) # Assuming 0 could be a fallback for <UNK> if not explicitly defined
+    return [char_to_id.get(char, unk_token_id) for char in word]
+def pad_sequence(sequence, max_length, pad_value): # Ensure pad_value is valid
+    if len(sequence) > max_length:
+        sequence = sequence[:max_length]
+    return sequence + [pad_value] * (max_length - len(sequence))
+def process_input(text, t5_tokenizer, encoder_tokenizer, char_to_id, current_max_char_len, max_token_len):
+    # Use current_max_char_len passed from loaded checkpoint
+    # Use max_token_len for t5_tokenizer and encoder_tokenizer max_length
+    t5_inputs = t5_tokenizer(
+        text,
+        return_tensors='pt',
+        padding='max_length',
+        truncation=True,
+        max_length=max_token_len, # Use max_token_len
+        add_special_tokens=True
+    )
+    encoder_inputs = encoder_tokenizer(
+        text,
+        return_tensors='pt',
+        padding='max_length',
+        truncation=True,
+        max_length=max_token_len # Use max_token_len
+    )
+    # Squeeze to remove batch dim if batch_size is 1, then unsqueeze later if model expects batch dim
+    # Assuming single instance processing here.
+    t5_input_ids_squeezed = t5_inputs['input_ids'].squeeze(0) # (seq_len)
+    t5_attention_mask_squeezed = t5_inputs['attention_mask'].squeeze(0) # (seq_len)
+    encoder_input_ids_squeezed = encoder_inputs['input_ids'].squeeze(0) # (seq_len)
+    # encoder_attention_mask_squeezed = encoder_inputs['attention_mask'].squeeze(0) # (seq_len)
+    # Max sequence length after tokenization for this specific input
+    actual_max_seq_len = encoder_input_ids_squeezed.shape[0]
+    char_input_tensor = torch.zeros((actual_max_seq_len, current_max_char_len), dtype=torch.long)
+    # Get pad_token_id for characters, ensure <PAD> is in char_to_id
+    char_pad_id = char_to_id.get("<PAD>", 0) # Default to 0 if <PAD> not in char_to_id
+    for j in range(actual_max_seq_len):
+        token_id = encoder_input_ids_squeezed[j].item()
+        # Avoid decoding special tokens like [PAD], [CLS], [SEP] into words for char tokenization
+        # if token_id in encoder_tokenizer.all_special_ids:
+        if token_id == encoder_tokenizer.pad_token_id or \
+           token_id == encoder_tokenizer.cls_token_id or \
+           token_id == encoder_tokenizer.sep_token_id or \
+           token_id == encoder_tokenizer.eos_token_id or \
+           token_id == encoder_tokenizer.bos_token_id:
+            word = "" # Treat special tokens as empty for char processing or handle as needed
+        else:
+            word = encoder_tokenizer.decode([token_id], skip_special_tokens=True).strip()
+        if not word: # Empty word or special token
+            char_ids = [char_pad_id] * current_max_char_len # Pad with <PAD> char id
+        else:
+            char_ids = tokenize_characters(word, char_to_id)
+            char_ids = pad_sequence(char_ids, current_max_char_len, char_pad_id)
+        char_input_tensor[j, :] = torch.tensor(char_ids, dtype=torch.long)
+    # sequence_lengths should be the sum of attention_mask for the encoder_input_ids
+    # This is for the word-level sequence length used by WordLSTMEncoder
+    # Squeeze if it's shape (1, 1) from sum, to get a scalar tensor if batch size is 1
+    sequence_lengths_tensor = encoder_inputs['attention_mask'].sum(dim=1).long().squeeze()
+    if sequence_lengths_tensor.ndim == 0: # If it became a 0-dim tensor (scalar)
+        sequence_lengths_tensor = sequence_lengths_tensor.unsqueeze(0) # Make it (1,) for consistency if batching
+    return {
+        # Unsqueeze(0) to add batch dimension back for the model
+        't5_input_ids': t5_input_ids_squeezed.unsqueeze(0).to(CFG.device),
+        't5_attention_mask': t5_attention_mask_squeezed.unsqueeze(0).to(CFG.device),
+        'encoder_input_ids': encoder_input_ids_squeezed.unsqueeze(0).to(CFG.device),
+        # 'encoder_attention_mask' is not directly used by your model.generate, t5_attention_mask is used
+        'char_input': char_input_tensor.unsqueeze(0).to(CFG.device),
+        'sequence_lengths': sequence_lengths_tensor.to(CFG.device) # Should be (batch_size,)
+    }
+# ----------------------------
+# FLASK SETUP
+# ----------------------------
+app = Flask(__name__)
+# MODIFIED: Define checkpoint path relative to MODEL_FILES_DIR
+checkpoint_file_path = os.path.join(MODEL_FILES_DIR, "best_model.pth")
+# Load your trained model and dictionaries ONCE at startup
+print("Initializing model...")
+try:
+    # model, char_to_id, id_to_char, max_char_len_loaded = load_checkpoint(checkpoint_file_path)
+    # Re-assign to global/module-level variables if you need them outside this scope,
+    # or pass them around. For Flask app, making them global for handlers is common.
+    loaded_model, loaded_char_to_id, loaded_id_to_char, loaded_max_char_len = load_checkpoint(checkpoint_file_path)
+except Exception as e:
+    print(f"FATAL: Could not load model on startup: {e}")
+    # In a real app, you might want to prevent Flask from starting or return errors
+    # sys.exit(1) # Not ideal for a web server trying to start
+    loaded_model = None # Indicate model loading failed
+@app.route('/')
+def index():
+    return render_template('index.html')
+@app.route('/translate', methods=['POST'])
+def translate_text():
+    if loaded_model is None: # Check if model failed to load
+        return jsonify({"error": "Model is not available. Please check server logs."}), 500
+    data = request.get_json()
+    input_text = data.get('text', '')
+    if not input_text:
+        return jsonify({"error": "No text provided"}), 400
+    try:
+        # Process the input through your pipeline
+        # Use the globally loaded CFG.t5_tokenizer, CFG.encoder_tokenizer,
+        # loaded_char_to_id, and loaded_max_char_len
+        inputs = process_input(
+            input_text,
+            CFG.t5_tokenizer,
+            CFG.encoder_tokenizer,
+            loaded_char_to_id,
+            loaded_max_char_len, # Use the max_char_len loaded from checkpoint
+            CFG.max_len          # Use CFG.max_len for token sequence length
+        )
+        # Generate translation
+        with torch.no_grad():
+            generated_ids = loaded_model.generate( # Use the loaded_model
+                inputs['t5_input_ids'],
+                inputs['t5_attention_mask'],
+                inputs['char_input'],
+                inputs['encoder_input_ids'], # This is word_input for HybridEncoder
+                inputs['sequence_lengths'],
+                max_length=CFG.max_len, # Max generation length
+                num_beams=4
+            )
+        translation = CFG.t5_tokenizer.decode(
+            generated_ids[0],
+            skip_special_tokens=True,
+            clean_up_tokenization_spaces=True
+        ).strip()
+        return jsonify({"translation": translation})
+    except Exception as e:
+        print(f"Error during translation: {e}") # Log the error
+        # import traceback
+        # traceback.print_exc() # For more detailed logs during debugging
+        return jsonify({"error": "An error occurred during translation."}), 500
+if __name__ == '__main__':
+    # Port for Hugging Face Spaces is usually set via PORT environment variable
+    port = int(os.environ.get("PORT", 7860))
+    # For local testing, debug=True is fine. For HF Spaces, it will be run by their infrastructure.
+    # Setting debug=False for production-like environments.
+    # The host='0.0.0.0' makes it accessible externally (needed for Docker/HF Spaces).
+    app.run(host='0.0.0.0', port=port, debug=False)

model_files/tokenizers/encoders_tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

model_files/tokenizers/encoders_tokenizer/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model_files/tokenizers/encoders_tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "full_tokenizer_file": null,
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": false,
+  "tokenizer_class": "ElectraTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

model_files/tokenizers/encoders_tokenizer/vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model_files/tokenizers/t5_tokenizer/added_tokens.json ADDED Viewed

	@@ -0,0 +1,130 @@

+{
+  "<extra_id_0>": 32099,
+  "<extra_id_10>": 32089,
+  "<extra_id_11>": 32088,
+  "<extra_id_12>": 32087,
+  "<extra_id_13>": 32086,
+  "<extra_id_14>": 32085,
+  "<extra_id_15>": 32084,
+  "<extra_id_16>": 32083,
+  "<extra_id_17>": 32082,
+  "<extra_id_18>": 32081,
+  "<extra_id_19>": 32080,
+  "<extra_id_1>": 32098,
+  "<extra_id_20>": 32079,
+  "<extra_id_21>": 32078,
+  "<extra_id_22>": 32077,
+  "<extra_id_23>": 32076,
+  "<extra_id_24>": 32075,
+  "<extra_id_25>": 32074,
+  "<extra_id_26>": 32073,
+  "<extra_id_27>": 32072,
+  "<extra_id_28>": 32071,
+  "<extra_id_29>": 32070,
+  "<extra_id_2>": 32097,
+  "<extra_id_30>": 32069,
+  "<extra_id_31>": 32068,
+  "<extra_id_32>": 32067,
+  "<extra_id_33>": 32066,
+  "<extra_id_34>": 32065,
+  "<extra_id_35>": 32064,
+  "<extra_id_36>": 32063,
+  "<extra_id_37>": 32062,
+  "<extra_id_38>": 32061,
+  "<extra_id_39>": 32060,
+  "<extra_id_3>": 32096,
+  "<extra_id_40>": 32059,
+  "<extra_id_41>": 32058,
+  "<extra_id_42>": 32057,
+  "<extra_id_43>": 32056,
+  "<extra_id_44>": 32055,
+  "<extra_id_45>": 32054,
+  "<extra_id_46>": 32053,
+  "<extra_id_47>": 32052,
+  "<extra_id_48>": 32051,
+  "<extra_id_49>": 32050,
+  "<extra_id_4>": 32095,
+  "<extra_id_50>": 32049,
+  "<extra_id_51>": 32048,
+  "<extra_id_52>": 32047,
+  "<extra_id_53>": 32046,
+  "<extra_id_54>": 32045,
+  "<extra_id_55>": 32044,
+  "<extra_id_56>": 32043,
+  "<extra_id_57>": 32042,
+  "<extra_id_58>": 32041,
+  "<extra_id_59>": 32040,
+  "<extra_id_5>": 32094,
+  "<extra_id_60>": 32039,
+  "<extra_id_61>": 32038,
+  "<extra_id_62>": 32037,
+  "<extra_id_63>": 32036,
+  "<extra_id_64>": 32035,
+  "<extra_id_65>": 32034,
+  "<extra_id_66>": 32033,
+  "<extra_id_67>": 32032,
+  "<extra_id_68>": 32031,
+  "<extra_id_69>": 32030,
+  "<extra_id_6>": 32093,
+  "<extra_id_70>": 32029,
+  "<extra_id_71>": 32028,
+  "<extra_id_72>": 32027,
+  "<extra_id_73>": 32026,
+  "<extra_id_74>": 32025,
+  "<extra_id_75>": 32024,
+  "<extra_id_76>": 32023,
+  "<extra_id_77>": 32022,
+  "<extra_id_78>": 32021,
+  "<extra_id_79>": 32020,
+  "<extra_id_7>": 32092,
+  "<extra_id_80>": 32019,
+  "<extra_id_81>": 32018,
+  "<extra_id_82>": 32017,
+  "<extra_id_83>": 32016,
+  "<extra_id_84>": 32015,
+  "<extra_id_85>": 32014,
+  "<extra_id_86>": 32013,
+  "<extra_id_87>": 32012,
+  "<extra_id_88>": 32011,
+  "<extra_id_89>": 32010,
+  "<extra_id_8>": 32091,
+  "<extra_id_90>": 32009,
+  "<extra_id_91>": 32008,
+  "<extra_id_92>": 32007,
+  "<extra_id_93>": 32006,
+  "<extra_id_94>": 32005,
+  "<extra_id_95>": 32004,
+  "<extra_id_96>": 32003,
+  "<extra_id_97>": 32002,
+  "<extra_id_98>": 32001,
+  "<extra_id_99>": 32000,
+  "<extra_id_9>": 32090,
+  "<extra_token_0>": 32101,
+  "<extra_token_10>": 32111,
+  "<extra_token_11>": 32112,
+  "<extra_token_12>": 32113,
+  "<extra_token_13>": 32114,
+  "<extra_token_14>": 32115,
+  "<extra_token_15>": 32116,
+  "<extra_token_16>": 32117,
+  "<extra_token_17>": 32118,
+  "<extra_token_18>": 32119,
+  "<extra_token_19>": 32120,
+  "<extra_token_1>": 32102,
+  "<extra_token_20>": 32121,
+  "<extra_token_21>": 32122,
+  "<extra_token_22>": 32123,
+  "<extra_token_23>": 32124,
+  "<extra_token_24>": 32125,
+  "<extra_token_25>": 32126,
+  "<extra_token_26>": 32127,
+  "<extra_token_2>": 32103,
+  "<extra_token_3>": 32104,
+  "<extra_token_4>": 32105,
+  "<extra_token_5>": 32106,
+  "<extra_token_6>": 32107,
+  "<extra_token_7>": 32108,
+  "<extra_token_8>": 32109,
+  "<extra_token_9>": 32110,
+  "<s>": 32100
+}

model_files/tokenizers/t5_tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,135 @@

+{
+  "additional_special_tokens": [
+    "<s>",
+    "</s>",
+    "<pad>",
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>"
+  ],
+  "bos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

model_files/tokenizers/t5_tokenizer/spiece.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7dcab96935a2a51b1461c84e44c952ea8a3640c8bc3e2c6ae7a21d855454ae27
+size 1111492

model_files/tokenizers/t5_tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,1169 @@

+{
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32000": {
+      "content": "<extra_id_99>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32001": {
+      "content": "<extra_id_98>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32002": {
+      "content": "<extra_id_97>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32003": {
+      "content": "<extra_id_96>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32004": {
+      "content": "<extra_id_95>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32005": {
+      "content": "<extra_id_94>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32006": {
+      "content": "<extra_id_93>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32007": {
+      "content": "<extra_id_92>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32008": {
+      "content": "<extra_id_91>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32009": {
+      "content": "<extra_id_90>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32010": {
+      "content": "<extra_id_89>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32011": {
+      "content": "<extra_id_88>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32012": {
+      "content": "<extra_id_87>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32013": {
+      "content": "<extra_id_86>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32014": {
+      "content": "<extra_id_85>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32015": {
+      "content": "<extra_id_84>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32016": {
+      "content": "<extra_id_83>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32017": {
+      "content": "<extra_id_82>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32018": {
+      "content": "<extra_id_81>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32019": {
+      "content": "<extra_id_80>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32020": {
+      "content": "<extra_id_79>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32021": {
+      "content": "<extra_id_78>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32022": {
+      "content": "<extra_id_77>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32023": {
+      "content": "<extra_id_76>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32024": {
+      "content": "<extra_id_75>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32025": {
+      "content": "<extra_id_74>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32026": {
+      "content": "<extra_id_73>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32027": {
+      "content": "<extra_id_72>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32028": {
+      "content": "<extra_id_71>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32029": {
+      "content": "<extra_id_70>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32030": {
+      "content": "<extra_id_69>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32031": {
+      "content": "<extra_id_68>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32032": {
+      "content": "<extra_id_67>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32033": {
+      "content": "<extra_id_66>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32034": {
+      "content": "<extra_id_65>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32035": {
+      "content": "<extra_id_64>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32036": {
+      "content": "<extra_id_63>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32037": {
+      "content": "<extra_id_62>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32038": {
+      "content": "<extra_id_61>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32039": {
+      "content": "<extra_id_60>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32040": {
+      "content": "<extra_id_59>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32041": {
+      "content": "<extra_id_58>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32042": {
+      "content": "<extra_id_57>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32043": {
+      "content": "<extra_id_56>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32044": {
+      "content": "<extra_id_55>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32045": {
+      "content": "<extra_id_54>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32046": {
+      "content": "<extra_id_53>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32047": {
+      "content": "<extra_id_52>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32048": {
+      "content": "<extra_id_51>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32049": {
+      "content": "<extra_id_50>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32050": {
+      "content": "<extra_id_49>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32051": {
+      "content": "<extra_id_48>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32052": {
+      "content": "<extra_id_47>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32053": {
+      "content": "<extra_id_46>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32054": {
+      "content": "<extra_id_45>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32055": {
+      "content": "<extra_id_44>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32056": {
+      "content": "<extra_id_43>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32057": {
+      "content": "<extra_id_42>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32058": {
+      "content": "<extra_id_41>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32059": {
+      "content": "<extra_id_40>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32060": {
+      "content": "<extra_id_39>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32061": {
+      "content": "<extra_id_38>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32062": {
+      "content": "<extra_id_37>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32063": {
+      "content": "<extra_id_36>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32064": {
+      "content": "<extra_id_35>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32065": {
+      "content": "<extra_id_34>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32066": {
+      "content": "<extra_id_33>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32067": {
+      "content": "<extra_id_32>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32068": {
+      "content": "<extra_id_31>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32069": {
+      "content": "<extra_id_30>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32070": {
+      "content": "<extra_id_29>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32071": {
+      "content": "<extra_id_28>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32072": {
+      "content": "<extra_id_27>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32073": {
+      "content": "<extra_id_26>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32074": {
+      "content": "<extra_id_25>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32075": {
+      "content": "<extra_id_24>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32076": {
+      "content": "<extra_id_23>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32077": {
+      "content": "<extra_id_22>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32078": {
+      "content": "<extra_id_21>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32079": {
+      "content": "<extra_id_20>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32080": {
+      "content": "<extra_id_19>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32081": {
+      "content": "<extra_id_18>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32082": {
+      "content": "<extra_id_17>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32083": {
+      "content": "<extra_id_16>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32084": {
+      "content": "<extra_id_15>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32085": {
+      "content": "<extra_id_14>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32086": {
+      "content": "<extra_id_13>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32087": {
+      "content": "<extra_id_12>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32088": {
+      "content": "<extra_id_11>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32089": {
+      "content": "<extra_id_10>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32090": {
+      "content": "<extra_id_9>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32091": {
+      "content": "<extra_id_8>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32092": {
+      "content": "<extra_id_7>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32093": {
+      "content": "<extra_id_6>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32094": {
+      "content": "<extra_id_5>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32095": {
+      "content": "<extra_id_4>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32096": {
+      "content": "<extra_id_3>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32097": {
+      "content": "<extra_id_2>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32098": {
+      "content": "<extra_id_1>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32099": {
+      "content": "<extra_id_0>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "32100": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32101": {
+      "content": "<extra_token_0>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32102": {
+      "content": "<extra_token_1>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32103": {
+      "content": "<extra_token_2>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32104": {
+      "content": "<extra_token_3>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32105": {
+      "content": "<extra_token_4>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32106": {
+      "content": "<extra_token_5>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32107": {
+      "content": "<extra_token_6>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32108": {
+      "content": "<extra_token_7>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32109": {
+      "content": "<extra_token_8>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32110": {
+      "content": "<extra_token_9>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32111": {
+      "content": "<extra_token_10>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32112": {
+      "content": "<extra_token_11>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32113": {
+      "content": "<extra_token_12>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32114": {
+      "content": "<extra_token_13>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32115": {
+      "content": "<extra_token_14>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32116": {
+      "content": "<extra_token_15>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32117": {
+      "content": "<extra_token_16>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32118": {
+      "content": "<extra_token_17>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32119": {
+      "content": "<extra_token_18>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32120": {
+      "content": "<extra_token_19>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32121": {
+      "content": "<extra_token_20>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32122": {
+      "content": "<extra_token_21>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32123": {
+      "content": "<extra_token_22>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32124": {
+      "content": "<extra_token_23>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32125": {
+      "content": "<extra_token_24>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32126": {
+      "content": "<extra_token_25>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32127": {
+      "content": "<extra_token_26>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<s>",
+    "</s>",
+    "<pad>",
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>"
+  ],
+  "bos_token": "</s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_ids": 100,
+  "extra_special_tokens": {},
+  "legacy": false,
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "T5Tokenizer",
+  "unk_token": "<unk>"
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+flask
+torch --index-url https://download.pytorch.org/whl/cpu # For CPU-only build
+transformers
+numpy
+protobuf
+sentencepiece
+gunicorn

static/style.css ADDED Viewed

	@@ -0,0 +1,274 @@

+:root {
+    --primary-color: #2962ff;
+    --secondary-color: #f8fafe;
+    --text-color: #1a1f36;
+    --border-color: #e1e6ef;
+    --hover-color: #edf2ff;
+    --shadow-color: rgba(0, 0, 0, 0.06);
+}
+* {
+    margin: 0;
+    padding: 0;
+    box-sizing: border-box;
+}
+body {
+    font-family: 'Google Sans', 'Roboto', sans-serif;
+    background: #fff;
+    color: var(--text-color);
+    min-height: 100vh;
+}
+.page-container {
+    min-height: 100vh;
+    max-width: 1200px;
+    margin: 0 auto;
+    padding: 40px 20px;
+    display: flex;
+    flex-direction: column;
+    align-items: center;
+}
+.container {
+    max-width: 800px;
+    width: 100%;
+    background: white;
+    border-radius: 20px;
+    padding: 30px;
+    box-shadow: 0 10px 30px var(--shadow-color);
+}
+header {
+    text-align: center;
+    margin-bottom: 20px;
+}
+h1 {
+    font-size: 2.5em;
+    font-weight: 600;
+    background: linear-gradient(135deg, var(--primary-color), #1e88e5);
+    -webkit-background-clip: text;
+    background-clip: text;
+    -webkit-text-fill-color: transparent;
+    margin-bottom: 10px;
+}
+.subtitle {
+    color: #666;
+    font-size: 1.1em;
+}
+.translation-box {
+    background: white;
+    border-radius: 16px;
+    box-shadow: 0 8px 30px var(--shadow-color);
+    width: 100%;
+    max-width: 900px;
+    margin-top: 30px;
+    overflow: hidden;
+    border: 1px solid var(--border-color);
+}
+.input-section, .output-section {
+    padding: 30px;
+}
+.input-section {
+    background: var(--secondary-color);
+    border-bottom: 1px solid var(--border-color);
+}
+.input-header, .output-header {
+    margin-bottom: 15px;
+}
+label {
+    font-weight: 500;
+    color: var(--text-color);
+    display: flex;
+    align-items: center;
+    gap: 8px;
+}
+.char-count {
+    color: #5f6368;
+    font-size: 0.8em;
+}
+textarea {
+    width: 100%;
+    min-height: 160px;
+    padding: 15px;
+    border: 1px solid var(--border-color);
+    border-radius: 12px;
+    background: white;
+    font-size: 1.1em;
+    line-height: 1.5;
+    transition: border-color 0.3s ease;
+}
+textarea:focus {
+    outline: none;
+    border-color: var(--primary-color);
+    box-shadow: 0 0 0 3px rgba(41, 98, 255, 0.1);
+}
+.controls {
+    padding: 15px 30px;
+    background: white;
+    border-top: 1px solid var(--border-color);
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+}
+.primary-btn, .secondary-btn, .icon-btn {
+    padding: 12px 25px;
+    border: none;
+    border-radius: 8px;
+    cursor: pointer;
+    font-size: 1em;
+    font-weight: 500;
+    display: flex;
+    align-items: center;
+    gap: 8px;
+    transition: all 0.3s ease;
+}
+.primary-btn {
+    background: var(--primary-color);
+    color: white;
+    padding: 12px 32px;
+    border-radius: 8px;
+    font-size: 1rem;
+    font-weight: 500;
+    letter-spacing: 0.3px;
+    transition: transform 0.2s ease, background 0.2s ease;
+}
+.primary-btn:hover {
+    background: #1e4bd8;
+    transform: translateY(-1px);
+}
+.secondary-btn {
+    background: #e0e0e0;
+    color: #666;
+}
+.secondary-btn:hover {
+    background: #d0d0d0;
+}
+.icon-btn {
+    width: 40px;
+    height: 40px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    border-radius: 10px;
+    transition: all 0.2s ease;
+}
+.icon-btn:hover {
+    background: var(--hover-color);
+    transform: scale(1.05);
+}
+.translation-result {
+    min-height: 120px;
+    background: white;
+    border-radius: 12px;
+    padding: 20px;
+    font-size: 1.1em;
+    line-height: 1.6;
+}
+/* Loading Spinner */
+.loading-spinner {
+    display: flex;
+    justify-content: center;
+    padding: 30px;
+}
+.spinner {
+    width: 30px;
+    height: 30px;
+    border: 3px solid var(--secondary-color);
+    border-top: 3px solid var(--primary-color);
+    border-radius: 50%;
+    animation: spin 1s linear infinite;
+}
+/* Toast Notification */
+.toast {
+    position: fixed;
+    bottom: 24px;
+    left: 50%;
+    transform: translateX(-50%);
+    background: #323232;
+    color: white;
+    padding: 12px 30px;
+    border-radius: 8px;
+    font-size: 0.95rem;
+    box-shadow: 0 2px 5px rgba(0,0,0,0.2);
+    animation: slideUp 0.3s ease;
+}
+/* Animations */
+@keyframes spin {
+    0% { transform: rotate(0deg); }
+    100% { transform: rotate(360deg); }
+}
+@keyframes fadeIn {
+    from { opacity: 0; transform: translate(-50%, 20px); }
+    to { opacity: 1; transform: translate(-50%, 0); }
+}
+@keyframes fadeOut {
+    from { opacity: 1; transform: translate(-50%, 0); }
+    to { opacity: 0; transform: translate(-50%, 20px); }
+}
+@keyframes slideUp {
+    from { transform: translate(-50%, 100%); opacity: 0; }
+    to { transform: translate(-50%, 0); opacity: 1; }
+}
+/* Responsive Design */
+@media (max-width: 768px) {
+    .page-container {
+        padding: 20px 15px;
+    }
+    .translation-box {
+        border-radius: 12px;
+        border-left: none;
+        border-right: none;
+    }
+    .container {
+        padding: 20px;
+        margin: 10px;
+    }
+    h1 {
+        font-size: 2em;
+    }
+    .input-section, .output-section {
+        padding: 20px;
+    }
+    .controls {
+        flex-direction: column;
+        padding: 15px 20px;
+    }
+    .primary-btn, .secondary-btn {
+        width: 100%;
+        justify-content: center;
+    }
+}

templates/index.html ADDED Viewed

	@@ -0,0 +1,144 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>BanglaFeel Translator</title>
+    <link rel="stylesheet" href="/static/style.css" />
+    <link href="https://fonts.googleapis.com/css2?family=Google+Sans:wght@400;500&family=Roboto:wght@400;500&display=swap" rel="stylesheet">
+    <link rel="stylesheet" href="https://fonts.googleapis.com/icon?family=Material+Icons">
+</head>
+<body>
+    <div class="page-container">
+        <header>
+            <h1>BanglaFeel Translator</h1>
+            <p class="subtitle">Translate English to Bengali with ease</p>
+        </header>
+        <div class="translation-box">
+            <div class="input-section">
+                <div class="input-header">
+                    <label>
+                        <span class="material-icons">language</span>
+                        English
+                    </label>
+                </div>
+                <textarea id="userInput" placeholder="Type or paste your text here..." maxlength="500"></textarea>
+            </div>
+            <div class="controls">
+                <div class="left-controls">
+                    <button id="clearBtn" class="icon-btn" title="Clear text">
+                        <span class="material-icons">clear</span>
+                    </button>
+                </div>
+                <div class="right-controls">
+                    <span class="char-count">0/500</span>
+                    <button id="translateBtn" class="primary-btn">
+                        <span class="material-icons">translate</span>
+                        Translate
+                    </button>
+                </div>
+            </div>
+            <div class="output-section">
+                <div class="output-header">
+                    <label>
+                        <span class="material-icons">translate</span>
+                        Bengali
+                    </label>
+                    <button id="copyBtn" class="icon-btn" title="Copy translation">
+                        <span class="material-icons">content_copy</span>
+                    </button>
+                </div>
+                <div class="translation-result">
+                    <p id="translationResult"></p>
+                    <div class="loading-spinner" style="display: none;">
+                        <div class="spinner"></div>
+                    </div>
+                </div>
+            </div>
+        </div>
+    </div>
+    <script>
+        document.addEventListener('DOMContentLoaded', function() {
+            const userInput = document.getElementById('userInput');
+            const charCount = document.querySelector('.char-count');
+            const translateBtn = document.getElementById('translateBtn');
+            const clearBtn = document.getElementById('clearBtn');
+            const copyBtn = document.getElementById('copyBtn');
+            const translationResult = document.getElementById('translationResult');
+            const loadingSpinner = document.querySelector('.loading-spinner');
+            // Character counter
+            userInput.addEventListener('input', function() {
+                charCount.textContent = `${this.value.length}/500`;
+            });
+            // Clear button
+            clearBtn.addEventListener('click', function() {
+                userInput.value = '';
+                translationResult.textContent = '';
+                charCount.textContent = '0/500';
+            });
+            // Copy button
+            copyBtn.addEventListener('click', function() {
+                if (translationResult.textContent) {
+                    navigator.clipboard.writeText(translationResult.textContent)
+                        .then(() => {
+                            copyBtn.innerHTML = '<span class="material-icons">check</span>';
+                            setTimeout(() => {
+                                copyBtn.innerHTML = '<span class="material-icons">content_copy</span>';
+                            }, 2000);
+                        });
+                }
+            });
+            // Translation
+            translateBtn.addEventListener('click', async function() {
+                const text = userInput.value.trim();
+                if (!text) {
+                    showToast('Please enter some text first.');
+                    return;
+                }
+                // Show loading state
+                loadingSpinner.style.display = 'flex';
+                translateBtn.disabled = true;
+                translationResult.style.opacity = '0.5';
+                try {
+                    const response = await fetch('/translate', {
+                        method: 'POST',
+                        headers: { 'Content-Type': 'application/json' },
+                        body: JSON.stringify({ text: text })
+                    });
+                    if (!response.ok) throw new Error(`Server error: ${response.status}`);
+                    const data = await response.json();
+                    translationResult.textContent = data.translation;
+                    translationResult.style.opacity = '1';
+                } catch (error) {
+                    console.error('Error:', error);
+                    translationResult.textContent = 'An error occurred during translation.';
+                    showToast('Translation failed. Please try again.');
+                } finally {
+                    loadingSpinner.style.display = 'none';
+                    translateBtn.disabled = false;
+                }
+            });
+            function showToast(message) {
+                const toast = document.createElement('div');
+                toast.className = 'toast';
+                toast.textContent = message;
+                document.body.appendChild(toast);
+                setTimeout(() => toast.remove(), 3000);
+            }
+        });
+    </script>
+</body>
+</html>