from prompt_parsing import (
    llm_wait_after_request,
    LLMWaitTime,
    prompt_paraphrase,
)

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI
from os import getenv
import glob
from pathlib import Path
import json
import random
from typing import Tuple


system_prompt = """
You are a helpful Artificial Intelligence (AI) Research bot. Below is the Abstract, Introduction, Related Work, Limitation/Future Work and the Conclusion of the Research Paper "Byte Latent Transformer: Patches Scale Better Than Tokens". Users can ask you questions about the paper, and you will provide the answers. Your answers must be in a detailed manner, and primarily come from the information in excerpts, additionally with your general knowledge. The goal is to help users understand fully.

# Abstract
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

# Introduction
We introduce the Byte Latent Transformer ( **BLT** ), a tokenizer-free architecture that learns from raw byte data and, for the first time, matches the performance of tokenization-based models at scale, with significant improvements in efficiency and robustness (§6). Existing large language models (llms) are trained almost entirely end-to-end, except for tokenization—a heuristic pre-processing step that groups bytes into a static set of tokens. Such tokens bias how a string is compressed, leading to shortcomings such as domain/modality sensitivity (Dagan et al., 2024), sensitivity to input noise (§6), a lack of orthographic knowledge (Edman et al., 2024), and multilingual inequity (Liang et al., 2023; Petrov et al., 2024; Limisiewicz et al., 2024).

Tokenization has previously been essential because directly training llms on bytes is prohibitively costly at scale due to long sequence lengths (Xue et al., 2022). Prior works mitigate this by employing more efficient self-attention (El Boukkouri et al., 2020; Clark et al., 2022) or attention-free architectures (Wang et al., 2024) (§8). However, this primarily helps train *small models* . At scale, the computational cost of a Transformer is dominated by large feed-forward network layers that run on every byte, not the cost of the attention mechanism.

To efficiently allocate compute, we propose a dynamic, learnable method for grouping bytes into *patches* (§2) and a new model architecture that mixes byte and patch information. Unlike tokenization, BLT has no fixed vocabulary for patches. Arbitrary groups of bytes are mapped to latent patch representations via light-weight learned encoder and decoder modules. We show that this results in *more* efficient allocation of compute than tokenization-based models.

Tokenization-based llms allocate the same amount of compute to every token. This trades efficiency for performance, since tokens are induced with compression heuristics that are not always correlated with the complexity of predictions. Central to our architecture is the idea that models should dynamically allocate compute where it is needed. For example, a large transformer is not needed to predict the ending of most words, since these are comparably easy, low-entropy decisions compared to choosing the first word of a new sentence. This is reflected in BLT’s architecture (§3) where there are three transformer blocks: two small byte-level *local models* and a large global *latent transformer* (Figure 2). To determine how to group bytes into patches and therefore how to dynamically allocate compute, BLT segments data based on the entropy of the next-byte prediction creating contextualized groupings of bytes with relatively uniform information density.

We present the first flop-controlled scaling study of byte-level models up to 8B parameters and 4T training bytes, showing that we can train a model end-to-end at scale from bytes without fixed-vocabulary tokenization. Overall, BLT matches training flop-controlled performance [1] of Llama 3 while using up to 50% fewer flops at inference (§5). We also show that directly working with raw bytes provides significant improvements in modeling the long-tail of the data. BLT models are more robust than tokenizer-based models to noisy inputs and display enhanced character level understanding abilities demonstrated on orthographic knowledge, phonology, and low-resource machine translation tasks (§6). Finally, with BLT models, we can simultaneously increase model size and patch size while maintaining the same inference flop budget. Longer patch sizes, on average, save compute which can be reallocated to grow the size of the global latent transformer, because it is run less often. We conduct inference-flop controlled scaling experiments (Figure 1), and observe significantly better scaling trends than with tokenization-based architectures.

In summary, this paper makes the following contributions: 1) We introduce BLT, a byte latent llm architecture that dynamically allocates compute to improve flop efficiency, 2) We show that we achieve training flopcontrolled parity with Llama 3 up to 8B scale while having the option to trade minor losses in evaluation metrics for flop efficiency gains of up to 50%, 3) BLT models unlock a new dimension for scaling llms, where model size can now be scaled while maintaining a fixed-inference budget, 4) We demonstrate the improved robustness of BLT models to input noise and their awareness of sub-word aspects of input data that token-based llms miss. We release the training and inference code for BLT at https://github.com/facebookresearch/blt.](https://github.com/facebookresearch/blt)

# Related Work
**Character-Level RNNs:** Character Language Modeling has been a popular task ever since the early days of neural models (Sutskever et al., 2011; Mikolov et al., 2012; Graves, 2013) owing to their flexibility of modeling out of vocabulary words organically without resorting to back-off methods. Kim et al. (2016) also train a model that processes characters only on the input side using convolutional and highway networks that feed into LSTM-based RNNs and are able to match performance with the RNN based state-of-the-art language models of the time on English and outperform them on morphologically rich languages, another sought-after advantage of character-level LLMs. Kenter et al. (2018) do machine comprehension using byte-level LSTM models that outperformed word-level models again on morphologically-rich Turkish and Russian languages. Along similar lines, Zhang et al. (2015) used character-based convolutional models for classification tasks, which outperformed word-level models for certain tasks. Chung et al. (2019) use hierarchical LSTM models using boundary-detectors at each level to discover the latent hierarchy in text, to further improve performance on character level language modeling. ByteNet by Kalchbrenner et al. (2016) uses CNN based layers on characters as opposed to attention for machine translation.

**Character-Level Transformers:** The development of transformer models using attention (Vaswani et al., 2017) together with subword tokenization (Sennrich et al., 2016), significantly improved the performance of neural models on language modeling and benchmark tasks. However, word and sub-word units implicitly define an inductive bias for the level of abstraction models should operate on. To combine the successes of transformer models with the initial promising results on character language modeling, Al-Rfou et al. (2019) use very deep transformers, and with the help of auxiliary losses, train transformer-based models that outperformed previous LSTM based character llms. However, they still saw a significant gap from word level LLMs. GPT-2 (Radford et al., 2019) also observed that on large scale datasets like the 1 billion word benchmark, byte-level LMs were not competitive with word-level LMs.

While Choe et al. (2019) demonstrated that byte-level llms based on transformers can outperform subword level LLMs with comparable parameters, the models take up much more compute and take much longer to train. Similarly, El Boukkouri et al. (2020) train a BERT model (CharFormer) that builds word representations by applying convolutions on character embeddings, and demonstrate improvements on the medical domain, but they also expend much more compute in doing so. Clark et al. (2022) develop CANINE, a 150M parameter encoder-only model that operates directly on character sequences. CANINE uses a deep transformer stack at its core similar in spirit to our global model, and a combination of a local transformer and strided convolutions to downsample the input characters, and outperforms the equivalent token-level encoder-only model (mBERT) on downstream multilingual tasks. ByT5 (Xue et al., 2022) explored approaches for byte-level encoder decoder models, that do not use any kind of patching operations. While their model exhibited improved robustness to noise, and was competitive with tokenizer-based models with 4x less data, the lack of patching meant that the models needed to compute expensive attention operations over every byte, which was extremely compute heavy. Directly modeling bytes instead of subword units increases the sequence length of the input making it challenging to efficiently scale byte level models. Recently, using the Mamba Architecture (Gu and Dao, 2023), which can maintain a fixed-size memory state over a very large context length, Wang et al. (2024) train a byte-level Mamba architecture also without using patching, and are able to outperform byte-level transformer models in a flop controlled setting at the 350M parameter scale in terms of bits-per-byte on several datasets.

**Patching-based approaches:** The effective use of patching can bring down the otherwise inflated number of flops expended by byte-level LLMs while potentially retaining performance, and many works demonstrated initial successes at a small scale of model size and number of training bytes. Nawrot et al. (2022) experiment with static patching based downsampling and upsampling and develop the hourglass transformer which outperforms other byte-level baselines at the 150M scale. Nawrot et al. (2023) further improve this with the help of dynamic patching schemes, including a boundary-predictor that is learned in an end-to-end fashion, a boundary-predictor supervised using certain tokenizers, as well as an entropy-based patching model similar to BLT, and show that this approach can outperform the vanilla transformers of the time on language modeling tasks at a 40M parameter scale on 400M tokens. Lester et al. (2024) investigate training on sequences compressed using arithmetic coding to achieve compression rates beyond what BPE can achieve, and by using a equal-info windows technique, are able to outperform byte-level baselines on language modeling tasks, but underperform subword baselines.

Our work draws inspiration and is most closely related to MegaByte (Yu et al., 2023), which is a decoder only causal LLM that uses a fixed static patching and concatenation of representations to convert bytes to patches, and uses a local model on the decoder side to convert from patches back into bytes. They demonstrate that MegaByte can match tokenizer-based models at a 1B parameter scale on a dataset of 400B bytes. We ablate MegaByte in all our experiments and find that static patching lags behind the current state-of-the-art compute optimally trained tokenizer based models in a flop controlled setting and we demonstrate how BLT bridges this gap. Slagle (2024) make the same observation about MegaByte and suggest extending the static patching method to patching on whitespaces and other space-like bytes, and also add a local encoder model. They find improvements over tokenized-based transformer models in a compute controlled setting on some domains such as Github and arXiv at the 1B parameter scale. We also report experiments with this model, and show that further architectural improvements are needed to scale up byte-level models even further and truly match current state-of-the-art token-based models such as Llama 3. 

# Limitations and Future Work
In this work, for the purposes of architectural choices, we train models for the optimal number of steps as determined for Llama 3 (Dubey et al., 2024). However, these scaling laws were calculated for BPE-level transformers and may lead to suboptimal (data, parameter sizes) ratios in the case of BLT. We leave for future work the calculation of scaling laws for BLT potentially leading to even more favorable scaling trends for our architecture. Additionally, many of these experiments were conducted at scales upto 1B parameters, and it is possible for the optimal architectural choices to change as we scale to 8B parameters and beyond, which may unlock improved performance for larger scales. Existing transformer libraries and codebases are designed to be highly efficient for tokenizer-based transformer architectures. While we present theoretical flop matched experiments and also use certain efficient implementations (such as FlexAttention) to handle layers that deviate from the vanilla transformer architecture, our implementations may yet not be at parity with tokenizer-based models in terms of wall-clock time and may benefit from further optimizations. While BLT uses a separately trained entropy model for patching, learning the patching model in an end-to-end fashion can be an interesting direction for future work. In Section 6.2, we present initial experiments showing indications of success for “byte-ifying” tokenizer-based models such as Llama 3 that are trained on more than 10T tokens, by initializing and freezing the global transformer with their weights. Further work in this direction may uncover methods that not only retain the benefits of bytefying, but also push performance beyond that of these tokenizer-based models without training them from scratch. 

# Conclusion
This paper presents the Byte Latent Transformer ( **BLT** ), a new architecture that redefines the conventional dependency on fixed-vocabulary tokenization in large language models. By introducing a dynamic, learnable method for grouping bytes into patches, BLT effectively allocates computational resources based on data complexity, leading to significant improvements in both efficiency and robustness. Our extensive scaling study demonstrates that BLT models can match the performance of tokenization-based models like Llama 3 at scales up to 8B and 4T bytes, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops. Furthermore, BLT unlocks a new dimension for scaling, allowing simultaneous increases in model and patch size within a fixed inference budget. This new paradigm becomes advantageous for compute regimes commonly encountered in practical settings. While directly engaging with raw byte data, BLT also improves the model’s ability to handle the long-tail of data, offering significant improvements in robustness to noisy inputs and a deeper understanding of sub-word structures. Overall, these results position BLT as a promising alternative to traditional tokenization-based approaches, providing a scalable and robust framework for more efficient and adaptable language models.
"""

user_prompt = "Answer this question: {question}"

prompt_qa = ChatPromptTemplate([("system", system_prompt), ("human", user_prompt)])

oneshot_deepseek_llm = ChatOpenAI(
    openai_api_key=getenv("OPENROUTER_API_KEY"),  # type: ignore
    openai_api_base="https://openrouter.ai/api/v1",  # type: ignore
    model_name="deepseek/deepseek-r1:free",  # type: ignore
)

oneshot_gemini_llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

oneshot_qwen_llm = ChatOpenAI(
    openai_api_key=getenv("OPENROUTER_API_KEY"),  # type: ignore
    openai_api_base="https://openrouter.ai/api/v1",  # type: ignore
    model_name="qwen/qwen-2.5-72b-instruct:free",  # type: ignore
)

oneshot_llama_llm = ChatOpenAI(
    openai_api_key=getenv("OPENROUTER_API_KEY2"),  # type: ignore
    openai_api_base="https://openrouter.ai/api/v1",  # type: ignore
    model_name="meta-llama/llama-3.3-70b-instruct:free",  # type: ignore
)


@llm_wait_after_request(LLMWaitTime.OpenRouter_DeepSeek_R1)
def llm_generate_answer_deepseek(question) -> str:
    chain_qa = prompt_qa | oneshot_deepseek_llm

    response_qa = chain_qa.invoke({"question": question})
    return str(response_qa.content)


prompt_paraphrase = ChatPromptTemplate(
    [("human", "Paraphrase the following {sentence_type}: {sentence}")]
)


@llm_wait_after_request(LLMWaitTime.Google_Gemini_20_Flash)
def llm_generate_answer_gemini(question) -> str:
    chain_qa = prompt_qa | oneshot_gemini_llm

    response_qa = chain_qa.invoke({"question": question})
    return str(response_qa.content)


@llm_wait_after_request(LLMWaitTime.OpenRouter_Qwen25_72B_Instruct)
def llm_paraphrase_qwen(sentence_type, sentence) -> str:
    chain_paraphrase = prompt_paraphrase | oneshot_qwen_llm

    response_paraphrase = chain_paraphrase.invoke(
        {
            "sentence_type": sentence_type,
            "sentence": sentence,
        }
    )
    return str(response_paraphrase.content)


def llm_paraphrase_qa_qwen(question, answer) -> Tuple[str, str]:
    res_question = llm_paraphrase_qwen("question", question)
    res_answer = llm_paraphrase_qwen("answer", answer)

    return (res_question, res_answer)


@llm_wait_after_request(LLMWaitTime.OpenRouter_Llama33_70B_Instruct)
def llm_paraphrase_llama(sentence_type, sentence) -> str:
    chain_paraphrase = prompt_paraphrase | oneshot_llama_llm

    response_paraphrase = chain_paraphrase.invoke(
        {
            "sentence_type": sentence_type,
            "sentence": sentence,
        }
    )
    return str(response_paraphrase.content)


def llm_paraphrase_qa_llama(question, answer) -> Tuple[str, str]:
    res_question = llm_paraphrase_llama("question", question)
    res_answer = llm_paraphrase_llama("answer", answer)

    return (res_question, res_answer)


def process_file_and_generate_answer(file_path, llm_func):
    question_tag = "Question:"
    file_name_answer_template = file_path.replace("_question", "_answer_q{0}")
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.read().splitlines()
        for idx, line in enumerate(lines):
            line = line.strip()
            if line.find(question_tag) >= 0:
                question = line.split(question_tag)[1].strip()
                print(f"Processing question: {question}")
                file_name_answer = file_name_answer_template.format(idx + 1)
                my_file = Path(file_name_answer)
                if my_file.is_file():
                    continue
                answer = llm_func(question)
                # answer = "AAA"
                with open(file_name_answer, "w", encoding="utf-8") as text_file:
                    print(f"{answer}", file=text_file)


def collect_all_questions_and_answers(file_path):
    question_tag = "Question:"
    file_name_answer_template = file_path.replace("_question", "_answer_q{0}")
    qa = []
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.read().splitlines()
        for idx, line in enumerate(lines):
            line = line.strip()
            if line.find(question_tag) >= 0:
                question = line.split(question_tag)[1].strip()
                print(f"Processing question: {question}")
                file_name_answer = file_name_answer_template.format(idx + 1)
                my_file = Path(file_name_answer)
                if not my_file.is_file():
                    raise FileNotFoundError(
                        f"Answer file not found: {file_name_answer}"
                    )
                with open(file_name_answer, "r", encoding="utf-8") as text_file:
                    answer = text_file.read().strip()
                    qa.append({"question": question, "answer": answer})

    return qa


if __name__ == "__main__":
    print("*** Start ***")

    ###### GENERATE ANSWER USING DEEPSEEK ######
    base_dir = "Data/DeepSeek"
    files = glob.glob(f"{base_dir}/*_question.txt")
    files = sorted(files)

    for file in files:
        print(f"Processing file: {file}")
        process_file_and_generate_answer(file, llm_generate_answer_deepseek)

    ###### GENERATE ANSWER USING GEMINI ######
    base_dir = "Data/Gemini"
    files = glob.glob(f"{base_dir}/*_question.txt")
    files = sorted(files)

    for file in files:
        print(f"Processing file: {file}")
        process_file_and_generate_answer(file, llm_generate_answer_gemini)

    ###### COLLECT ALL QUESTIONS AND ANSWERS ######
    paths = [
        "Data/ChatGPT",
        "Data/DeepSeek",
        "Data/Gemini",
    ]

    all_qa = []

    for path in paths:
        files = glob.glob(f"{path}/*_question.txt")
        files = sorted(files)
        for file in files:
            print(f"Processing file: {file}")
            qa = collect_all_questions_and_answers(file)
            all_qa.extend(qa)

    ###### Paraphrase using Qwen ######
    paraphrased_all_qa_qwen = []
    for qa in all_qa:
        res_question, res_answer = llm_paraphrase_qa_qwen(qa["question"], qa["answer"])
        paraphrased_all_qa_qwen.append(
            {
                "question": res_question,
                "answer": res_answer,
            }
        )

    ###### Paraphrase using Llama ######
    paraphrased_all_qa_llama = []
    for qa in all_qa:
        res_question, res_answer = llm_paraphrase_qa_llama(qa["question"], qa["answer"])
        # Quick fix
        if res_question.find("paraphrased version") >= 0:
            colon_pos = res_question.find(":")
            if colon_pos >= 0:
                res_question = res_question[colon_pos + 1 :].strip()
        if res_answer.find("paraphrased version") >= 0:
            colon_pos = res_answer.find(":")
            if colon_pos >= 0:
                res_answer = res_answer[colon_pos + 1 :].strip()

        paraphrased_all_qa_llama.append(
            {
                "question": res_question,
                "answer": res_answer,
            }
        )

    # Combine all paraphrased QA
    all_qa.extend(paraphrased_all_qa_qwen)
    all_qa.extend(paraphrased_all_qa_llama)

    ###### SAVE ######
    random.shuffle(all_qa)
    all_qa_jsons = [json.dumps(qa) for qa in all_qa]

    split = round(len(all_qa_jsons) * 0.9)
    qa_for_fine_tuning_train = all_qa_jsons[:split]
    qa_for_fine_tuning_eval = all_qa_jsons[split:]

    whole_text_train = "\n".join(qa_for_fine_tuning_train)
    with open("qa_for_fine_tuning_train.json", "w", encoding="utf-8") as text_file:
        print(f"{whole_text_train}", file=text_file)

    whole_text_eval = "\n".join(qa_for_fine_tuning_eval)
    with open("qa_for_fine_tuning_eval.json", "w", encoding="utf-8") as text_file:
        print(f"{whole_text_eval}", file=text_file)

    print("*** Done ***")