SPECTR

A deep learning model for predicting chemical formulas from mass spectrometry data using a CNN-Transformer architecture.

Overview

SPECTR is a machine learning system that accepts mass spectral data (m/z peaks and intensities) and predicts the corresponding chemical formula. The model uses a convolutional neural network encoder to process spectral peaks and a transformer decoder to generate chemical formulas token by token.

Key Features

  • CNN-based encoder for processing mass spectrometry peaks
  • Transformer decoder for sequential chemical formula generation
  • Support for all chemical elements from H to Og
  • Handles up to 300 peaks per spectrum
  • Multiple decoding strategies: greedy, beam search, top-k, and top-p sampling
  • Training with gradient accumulation and learning rate scheduling
  • Integration with Weights & Biases for experiment tracking
  • Checkpoint saving and resumption support

Architecture

The model consists of two main components:

Encoder

  • 4-layer CNN with batch normalization and max pooling
  • Processes m/z and intensity pairs
  • Global adaptive pooling for consistent output size
  • Output: encoded representation of spectral data

Decoder

  • Transformer decoder with multi-head attention
  • Token embedding with positional encoding
  • Causal masking for autoregressive generation
  • Vocabulary: special tokens, chemical elements (H-Og), and digits (0-9)

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • NumPy
  • pandas
  • scikit-learn
  • wandb (for experiment tracking)
  • tqdm (optional, for progress bars)

Additional dependencies for data preparation:

  • requests
  • beautifulsoup4

Installation

  1. Clone the repository:
git clone https://github.com/krll-corp/SPECTR.git
cd SPECTR
  1. Install dependencies:
pip install torch numpy pandas scikit-learn wandb tqdm requests beautifulsoup4

Data Format

The model expects data in JSONL format with the following structure:

{"formula": "C6H12O6", "peaks": [{"m/z": 180.063, "intensity": 999}, {"m/z": 145.050, "intensity": 450}]}

Each line contains:

  • formula: Chemical formula as a string (e.g., "C6H12O6")
  • peaks: List of peak objects with m/z (mass-to-charge ratio) and intensity values

Usage

Training

Train the model using the train_conv.py script:

python train_conv.py

Training options:

  • --resume: Resume from previous run if available
  • --checkpoint: Path to checkpoint file (default: checkpoint_last.pt)
  • --save-every: Save checkpoint every N steps (default: 500)

The script expects a data file at ../mona_massbank_dataset.jsonl. Modify the data_file variable in the script to point to your dataset.

Training hyperparameters (configurable in the script):

  • Model dimension: 256
  • Attention heads: 8
  • Encoder/decoder layers: 8
  • Batch size: 64
  • Learning rate: 1e-4
  • Max training steps: 10,000
  • Gradient accumulation: 16 steps
  • Warmup steps: 3% of max steps

Evaluation

Evaluate a trained model using the eval_conv3.py script:

python eval_conv3.py --checkpoint <path_to_checkpoint> --data <path_to_data> --device cuda

Options:

  • --checkpoint: Path to trained model checkpoint (default: checkpoint_best.pt)
  • --data: Path to evaluation data file (default: massbank_dataset.jsonl)
  • --device: Device to use (choices: cpu, cuda, mps)
  • --strategy: Decoding strategy (choices: greedy, beam, top_k, top_p; default: greedy)
  • --beam-width: Beam width for beam search (default: 3)
  • --top-k: Top-k value for top-k sampling (default: 5)
  • --top-p: Top-p value for nucleus sampling (default: 0.9)
  • --limit: Limit number of samples to evaluate (optional)

Data Preparation

The data/ directory contains utilities for data collection and preparation:

  • crawler.py: Extract mass spectrometry data from MassBank web records
  • filtering.py: Process and filter MoNA and MassBank datasets into training format
  • analyze_massbank.py: Analyze MassBank dataset statistics

Example usage for data crawling:

cd data
python crawler.py

Model Specifications

Default configuration:

  • Input: Up to 300 mass spectrometry peaks (m/z, intensity pairs)
  • Output: Chemical formula up to 50 tokens
  • Vocabulary size: 135 tokens (4 special + 118 elements + 10 digits + 3 reserved)
  • Model parameters: ~12M (base configuration)

Supported chemical elements: All elements from H (Hydrogen) to Og (Oganesson)

Special tokens:

  • <PAD>: Padding token
  • <SOS>: Start of sequence
  • <EOS>: End of sequence
  • <UNK>: Unknown token

Training Features

  • Automatic mixed precision training support
  • TF32 precision for Ampere GPUs
  • Learning rate warmup and cosine decay
  • Gradient accumulation for effective larger batch sizes
  • Token-level accuracy metric (excluding padding)
  • Validation every 1000 steps
  • Automatic checkpoint saving
  • Resume capability from saved checkpoints
  • Weights & Biases integration for tracking

Performance

The model is evaluated using:

  • Cross-entropy loss (ignoring padding tokens)
  • Token-level accuracy
  • Perplexity
  • Exact match accuracy (during evaluation)

License

This project is licensed under the MIT License. See the LICENSE file for details.

Copyright (c) 2025 Kyryll Kochkin

Citation

If you use SPECTR in your research, please cite this repository:

@software{spectr2025,
  author = {Kochkin, Kyryll},
  title = {SPECTR: Mass Spectrometry to Chemical Formula Prediction},
  year = {2025},
  url = {https://github.com/krll-corp/SPECTR}
}

Contributing

Contributions are welcome. Please open an issue or submit a pull request for any improvements or bug fixes.

Acknowledgments

This project uses data from:

  • MassBank: A public repository of mass spectra
  • MoNA (MassBank of North America): Mass spectral database
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train k050506koch/SPECTR