Perovskite Chemical Tokenizer
A specialized tokenizer for perovskite chemical formulas trained on 44 chemical elements.
Features
- Chemical-aware tokenization preserving element boundaries
- Trained on large-scale perovskite formula dataset
- Vocabulary size: 169
- Supports 44 chemical elements including organic cations (MA, FA, etc.)
Usage
from tokenizers import Tokenizer
from perovskite_tokenizer import PerovskiteTokenizer
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
# Tokenize a formula
result = tokenizer.encode("CsPbI3")
print(result.tokens) # ['[CLS]', 'Cs', 'Pb', 'I', '3', '[SEP]']
Elements Supported
DMA, Na, Te, Bi, Sb, Ti, Tb, Mg, Li, Cd, Yb, Ni, Nb, Mn, Cu, Tl, Zn, Br, Sr, Hg...