SPECTR

A deep learning model for predicting chemical formulas from mass spectrometry data using a CNN-Transformer architecture.

Overview

SPECTR is a machine learning system that accepts mass spectral data (m/z peaks and intensities) and predicts the corresponding chemical formula. The model uses a convolutional neural network encoder to process spectral peaks and a transformer decoder to generate chemical formulas token by token.

Key Features

CNN-based encoder for processing mass spectrometry peaks
Transformer decoder for sequential chemical formula generation
Support for all chemical elements from H to Og
Handles up to 300 peaks per spectrum
Multiple decoding strategies: greedy, beam search, top-k, and top-p sampling
Training with gradient accumulation and learning rate scheduling
Integration with Weights & Biases for experiment tracking
Checkpoint saving and resumption support

Architecture

The model consists of two main components:

Encoder

4-layer CNN with batch normalization and max pooling
Processes m/z and intensity pairs
Global adaptive pooling for consistent output size
Output: encoded representation of spectral data

Decoder

Transformer decoder with multi-head attention
Token embedding with positional encoding
Causal masking for autoregressive generation
Vocabulary: special tokens, chemical elements (H-Og), and digits (0-9)

Requirements

Python 3.8+
PyTorch 2.0+
NumPy
pandas
scikit-learn
wandb (for experiment tracking)
tqdm (optional, for progress bars)

Additional dependencies for data preparation:

requests
beautifulsoup4

Installation

Clone the repository:

git clone https://github.com/krll-corp/SPECTR.git
cd SPECTR

Install dependencies:

pip install torch numpy pandas scikit-learn wandb tqdm requests beautifulsoup4

Data Format

The model expects data in JSONL format with the following structure:

{"formula": "C6H12O6", "peaks": [{"m/z": 180.063, "intensity": 999}, {"m/z": 145.050, "intensity": 450}]}

Each line contains:

formula: Chemical formula as a string (e.g., "C6H12O6")
peaks: List of peak objects with m/z (mass-to-charge ratio) and intensity values

Usage

Training

Train the model using the train_conv.py script:

python train_conv.py

Training options:

--resume: Resume from previous run if available
--checkpoint: Path to checkpoint file (default: checkpoint_last.pt)
--save-every: Save checkpoint every N steps (default: 500)

The script expects a data file at ../mona_massbank_dataset.jsonl. Modify the data_file variable in the script to point to your dataset.

Training hyperparameters (configurable in the script):

Model dimension: 256
Attention heads: 8
Encoder/decoder layers: 8
Batch size: 64
Learning rate: 1e-4
Max training steps: 10,000
Gradient accumulation: 16 steps
Warmup steps: 3% of max steps

Evaluation

Evaluate a trained model using the eval_conv3.py script:

python eval_conv3.py --checkpoint <path_to_checkpoint> --data <path_to_data> --device cuda

Options:

--checkpoint: Path to trained model checkpoint (default: checkpoint_best.pt)
--data: Path to evaluation data file (default: massbank_dataset.jsonl)
--device: Device to use (choices: cpu, cuda, mps)
--strategy: Decoding strategy (choices: greedy, beam, top_k, top_p; default: greedy)
--beam-width: Beam width for beam search (default: 3)
--top-k: Top-k value for top-k sampling (default: 5)
--top-p: Top-p value for nucleus sampling (default: 0.9)
--limit: Limit number of samples to evaluate (optional)

Data Preparation

The data/ directory contains utilities for data collection and preparation:

crawler.py: Extract mass spectrometry data from MassBank web records
filtering.py: Process and filter MoNA and MassBank datasets into training format
analyze_massbank.py: Analyze MassBank dataset statistics

Example usage for data crawling:

cd data
python crawler.py

Model Specifications

Default configuration:

Input: Up to 300 mass spectrometry peaks (m/z, intensity pairs)
Output: Chemical formula up to 50 tokens
Vocabulary size: 135 tokens (4 special + 118 elements + 10 digits + 3 reserved)
Model parameters: ~12M (base configuration)

Supported chemical elements: All elements from H (Hydrogen) to Og (Oganesson)

Special tokens:

<PAD>: Padding token
<SOS>: Start of sequence
<EOS>: End of sequence
<UNK>: Unknown token

Training Features

Automatic mixed precision training support
TF32 precision for Ampere GPUs
Learning rate warmup and cosine decay
Gradient accumulation for effective larger batch sizes
Token-level accuracy metric (excluding padding)
Validation every 1000 steps
Automatic checkpoint saving
Resume capability from saved checkpoints
Weights & Biases integration for tracking

Performance

The model is evaluated using:

Cross-entropy loss (ignoring padding tokens)
Token-level accuracy
Perplexity
Exact match accuracy (during evaluation)

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use SPECTR in your research, please cite this repository:

@software{spectr2025,
  author = {Kochkin, Kyryll},
  title = {SPECTR: Mass Spectrometry to Chemical Formula Prediction},
  year = {2025},
  url = {https://github.com/krll-corp/SPECTR}
}

Contributing

Contributions are welcome. Please open an issue or submit a pull request for any improvements or bug fixes.

Acknowledgments

This project uses data from:

MassBank: A public repository of mass spectra
MoNA (MassBank of North America): Mass spectral database

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

k050506koch
/

SPECTR