Model Card for peleke-mistral-7b-instruct-v0.2

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.2 for antibody sequence generation. It takes in an antigen sequence and returns novel Fv portions of heavy and light chain antibody sequences.

Important Note on Vocabulary Size

This model was fine-tuned with additional special tokens, resulting in a vocabulary size of 32,005 tokens (vs. the base model's 32,000). You must resize the embeddings before loading the PEFT adapters.

Quick Start

1. Load the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import torch

model_path = 'silicobio/peleke-mistral-7b-instruct-v0.2'

# Load configuration and tokenizer
config = PeftConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Set pad token if not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    device_map="auto",  # Automatically handle device placement
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

# IMPORTANT: Resize embeddings to match the fine-tuned vocabulary size
expected_vocab_size = 32005  # The fine-tuned model has 5 additional tokens
base_model.resize_token_embeddings(expected_vocab_size)

# Load PEFT adapters
model = PeftModel.from_pretrained(
    base_model, 
    model_path,
    is_trainable=False  # Set to False for inference
)

model.eval()

2. Format Your Input

This model uses <epi> and </epi> to annotate epitope residues of interest. It may be easier to use other characters for annotation, such as [ ]. For example: ...CSFS[S][F][V]L[N]WY...

import re

def convert_epitope_format(sequence):
    """Convert [X] format to <epi>X</epi> format"""
    return re.sub(r'\[([A-Z])\]', r'<epi>\1</epi>', sequence)

def format_prompt(antigen_sequence):
    """Format the antigen sequence for Mistral model"""
    formatted_antigen = convert_epitope_format(antigen_sequence)
    prompt = f"Antigen: <s>{formatted_antigen}</s>\nAntibody:"
    return prompt

3. Generate an Antibody Sequence

def generate_antibody(model, tokenizer, antigen_sequence):
    # Format the prompt
    prompt = format_prompt(antigen_sequence)
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
    
    # Move to device
    device = next(model.parameters()).device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Store input length to extract only generated tokens
    input_length = inputs['input_ids'].shape[1]
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=800,
            do_sample=True,
            temperature=0.7,
            pad_token_id=tokenizer.pad_token_id,
            use_cache=False,
        )
    
    # Decode only the generated part
    generated_tokens = outputs[0][input_length:]
    antibody_sequence = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
    
    return antibody_sequence

# Example usage
antigen = "AYSTREILLALCIRDSRVHGNGTLHPVLELAARETPLRLSPEDTVVLRYHVLLEE[I]IE[R][N]RVLFATGSPYYDKNSP"
antibody = generate_antibody(model, tokenizer, antigen)
print(f"Antigen: {antigen}\nAntibody: {antibody}")

This will generate a |-delimited output, which represents the Fv portions of heavy and light chains:

Antigen: NPPTFSPALL...
Antibody: QVQLVQSGGG...|DIQMTQSPSS...

Alternative Loading Method (if vocabulary size issues persist)

def load_model_with_vocab_fix(model_path):
    from peft import PeftConfig, PeftModel
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    # Load config
    config = PeftConfig.from_pretrained(model_path)
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        config.base_model_name_or_path,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True
    )
    
    # Force resize to match fine-tuned vocab size
    base_model.resize_token_embeddings(32005)
    
    # Load PEFT adapters
    model = PeftModel.from_pretrained(
        base_model,
        model_path,
        is_trainable=False
    )
    
    model.eval()
    return model, tokenizer

Special Tokens

The model recognizes the following special tokens:

<epi>, </epi>: Epitope markers
<s>, </s>: Sentence boundaries (Mistral format)
|: Chain separator (between heavy and light chains)

Additional tokens that may be present:

Antigen, Antibody, Epitope: Task identifiers
Single amino acid codes: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y

Training Procedure

This model was trained with Supervised Fine-Tuning (SFT) on antibody-antigen pairs from the SAbDab database.

Framework Versions

PEFT 0.17.0
TRL: 0.19.1
Transformers: 4.54.0
PyTorch: 2.7.1
Datasets: 4.0.0

Known Issues and Solutions

Vocabulary Size Mismatch

If you encounter a RuntimeError about size mismatch (32005 vs 32000), ensure you resize the embeddings BEFORE loading the PEFT adapters:

base_model.resize_token_embeddings(32005)  # Must be done before PeftModel.from_pretrained()

Generation Parameters

For best results, use:

temperature: 0.7-0.9 for diversity
max_new_tokens: 800-1000 (antibody sequences can be long)
do_sample: True
use_cache: False (for memory efficiency with long sequences)

Citation

If you use this model in your research, please cite:

@model{peleke-mistral-2024,
  title={Peleke Mistral 7B Instruct v0.2 for Antibody Generation},
  author={SilicoBio},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/silicobio/peleke-mistral-7b-instruct-v0.2}
}

Tokenizers: 0.21.2

Downloads last month: 6

Model tree for silicobio/peleke-mistral-7b-instruct-v0.2

Base model

mistralai/Mistral-7B-Instruct-v0.2

Adapter

(1114)

this model

Dataset used to train silicobio/peleke-mistral-7b-instruct-v0.2

Collection including silicobio/peleke-mistral-7b-instruct-v0.2

peleke-1 🦋

Collection

Fine-Tuned Protein Language Models for Targeted Antibody Sequence Generation. • 8 items • Updated 7 days ago • 1