Model Card for peleke-mistral-7b-instruct-v0.2
This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.2 for antibody sequence generation. It takes in an antigen sequence and returns novel Fv portions of heavy and light chain antibody sequences.
Important Note on Vocabulary Size
This model was fine-tuned with additional special tokens, resulting in a vocabulary size of 32,005 tokens (vs. the base model's 32,000). You must resize the embeddings before loading the PEFT adapters.
Quick Start
1. Load the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import torch
model_path = 'silicobio/peleke-mistral-7b-instruct-v0.2'
# Load configuration and tokenizer
config = PeftConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Set pad token if not already set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
device_map="auto", # Automatically handle device placement
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
)
# IMPORTANT: Resize embeddings to match the fine-tuned vocabulary size
expected_vocab_size = 32005 # The fine-tuned model has 5 additional tokens
base_model.resize_token_embeddings(expected_vocab_size)
# Load PEFT adapters
model = PeftModel.from_pretrained(
base_model,
model_path,
is_trainable=False # Set to False for inference
)
model.eval()
2. Format Your Input
This model uses <epi> and </epi> to annotate epitope residues of interest.
It may be easier to use other characters for annotation, such as [ ]. For example: ...CSFS[S][F][V]L[N]WY...
import re
def convert_epitope_format(sequence):
"""Convert [X] format to <epi>X</epi> format"""
return re.sub(r'\[([A-Z])\]', r'<epi>\1</epi>', sequence)
def format_prompt(antigen_sequence):
"""Format the antigen sequence for Mistral model"""
formatted_antigen = convert_epitope_format(antigen_sequence)
prompt = f"Antigen: <s>{formatted_antigen}</s>\nAntibody:"
return prompt
3. Generate an Antibody Sequence
def generate_antibody(model, tokenizer, antigen_sequence):
# Format the prompt
prompt = format_prompt(antigen_sequence)
# Tokenize
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
# Move to device
device = next(model.parameters()).device
inputs = {k: v.to(device) for k, v in inputs.items()}
# Store input length to extract only generated tokens
input_length = inputs['input_ids'].shape[1]
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=800,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.pad_token_id,
use_cache=False,
)
# Decode only the generated part
generated_tokens = outputs[0][input_length:]
antibody_sequence = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
return antibody_sequence
# Example usage
antigen = "AYSTREILLALCIRDSRVHGNGTLHPVLELAARETPLRLSPEDTVVLRYHVLLEE[I]IE[R][N]RVLFATGSPYYDKNSP"
antibody = generate_antibody(model, tokenizer, antigen)
print(f"Antigen: {antigen}\nAntibody: {antibody}")
This will generate a |-delimited output, which represents the Fv portions of heavy and light chains:
Antigen: NPPTFSPALL...
Antibody: QVQLVQSGGG...|DIQMTQSPSS...
Alternative Loading Method (if vocabulary size issues persist)
def load_model_with_vocab_fix(model_path):
from peft import PeftConfig, PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load config
config = PeftConfig.from_pretrained(model_path)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
device_map="auto",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
)
# Force resize to match fine-tuned vocab size
base_model.resize_token_embeddings(32005)
# Load PEFT adapters
model = PeftModel.from_pretrained(
base_model,
model_path,
is_trainable=False
)
model.eval()
return model, tokenizer
Special Tokens
The model recognizes the following special tokens:
<epi>,</epi>: Epitope markers<s>,</s>: Sentence boundaries (Mistral format)|: Chain separator (between heavy and light chains)
Additional tokens that may be present:
Antigen,Antibody,Epitope: Task identifiers- Single amino acid codes:
A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
Training Procedure
This model was trained with Supervised Fine-Tuning (SFT) on antibody-antigen pairs from the SAbDab database.
Framework Versions
- PEFT 0.17.0
- TRL: 0.19.1
- Transformers: 4.54.0
- PyTorch: 2.7.1
- Datasets: 4.0.0
Known Issues and Solutions
Vocabulary Size Mismatch
If you encounter a RuntimeError about size mismatch (32005 vs 32000), ensure you resize the embeddings BEFORE loading the PEFT adapters:
base_model.resize_token_embeddings(32005) # Must be done before PeftModel.from_pretrained()
Generation Parameters
For best results, use:
temperature: 0.7-0.9 for diversitymax_new_tokens: 800-1000 (antibody sequences can be long)do_sample: Trueuse_cache: False (for memory efficiency with long sequences)
Citation
If you use this model in your research, please cite:
@model{peleke-mistral-2024,
title={Peleke Mistral 7B Instruct v0.2 for Antibody Generation},
author={SilicoBio},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/silicobio/peleke-mistral-7b-instruct-v0.2}
}
- Tokenizers: 0.21.2
- Downloads last month
- 6
Model tree for silicobio/peleke-mistral-7b-instruct-v0.2
Base model
mistralai/Mistral-7B-Instruct-v0.2