Perovskite Chemical Tokenizer

A specialized tokenizer for perovskite chemical formulas trained on 44 chemical elements.

Features

  • Chemical-aware tokenization preserving element boundaries
  • Trained on large-scale perovskite formula dataset
  • Vocabulary size: 169
  • Supports 44 chemical elements including organic cations (MA, FA, etc.)

Usage

from tokenizers import Tokenizer
from perovskite_tokenizer import PerovskiteTokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Tokenize a formula
result = tokenizer.encode("CsPbI3")
print(result.tokens)  # ['[CLS]', 'Cs', 'Pb', 'I', '3', '[SEP]']

Elements Supported

DMA, Na, Te, Bi, Sb, Ti, Tb, Mg, Li, Cd, Yb, Ni, Nb, Mn, Cu, Tl, Zn, Br, Sr, Hg...

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support