deeptaxa / README.md
Ahmed Moustafa
Initial commit
40a4b1e
metadata
language: en
tags:
  - bioinformatics
  - microbiology
  - microbiome
  - taxonomy-classification
  - deep-learning
  - 16s-rrna
datasets:
  - systems-genomics-lab/greengenes
metrics:
  - accuracy
  - precision
  - recall
  - f1
license: mit
model-index:
  - name: DeepTaxa Hybrid CNN-BERT (April 2025)
    results:
      - task:
          type: classification
          name: Hierarchical Taxonomy Classification
        dataset:
          type: systems-genomics-lab/greengenes
          name: Greengenes (2024-09 Validation Split)
          split: validation
        metrics:
          - type: accuracy
            value: 0.9999258655200534
            name: Domain Accuracy
          - type: accuracy
            value: 0.9992339437072182
            name: Phylum Accuracy
          - type: accuracy
            value: 0.9988879828008006
            name: Class Accuracy
          - type: accuracy
            value: 0.9971581782687128
            name: Order Accuracy
          - type: accuracy
            value: 0.9950824128302074
            name: Family Accuracy
          - type: accuracy
            value: 0.9833444535053253
            name: Genus Accuracy
          - type: accuracy
            value: 0.9528751822472632
            name: Species Accuracy

DeepTaxa: Hybrid CNN-BERT Model (April 2025)

DeepTaxa is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.

Model Details

  • Architecture: HybridCNNBERTClassifier (CNN + BERT)
  • Tokenizer: zhihan1996/DNABERT-2-117M
  • Training Data: Greengenes dataset (2024-09 split)
  • Levels Predicted: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
  • Total Parameters: 72,635,154
  • Max Sequence Length: 512
  • Dropout Probability: 0.2
  • License: MIT
  • Version: April 2025
  • File: deeptaxa_april_2025.pt

Usage

Download the Model

To get started, download the pre-trained model file deeptaxa_april_2025.pt from this repository:

  • Manual Download: Visit https://huggingface.co/systems-genomics-lab/deeptaxa, click on the "Files and versions" tab, and download deeptaxa_april_2025.pt (871 MB).
  • Command Line (wget):
    wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
    
  • Command Line (git clone):
    git clone https://huggingface.co/systems-genomics-lab/deeptaxa
    cd deeptaxa
    # The model file is now in the current directory
    

Run Predictions

Once downloaded, use the model with the DeepTaxa CLI:

python -m deeptaxa.cli predict \
  --fasta-file /path/to/sequences.fna.gz \
  --checkpoint deeptaxa_april_2025.pt

Full instructions are available on the GitHub repository.

Training Details

  • Dataset: 161,866 training sequences, 40,467 validation sequences from Greengenes (gg_2024_09_training.fna.gz, gg_2024_09_training.tsv.gz)
  • Hyperparameters:
    • Learning Rate: 0.0001
    • Batch Size: 16
    • Epochs: 10
    • Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
    • Focal Loss Gamma: 2.0
    • Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
  • Training Time: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
  • Timestamp: Trained on 2025-04-04

Performance

Validation metrics (on 40,467 sequences):

Level Accuracy Precision Recall F1-Score
Domain 99.99% 99.99% 99.99% 99.99%
Phylum 99.92% 99.92% 99.92% 99.92%
Class 99.89% 99.85% 99.89% 99.87%
Order 99.72% 99.64% 99.72% 99.67%
Family 99.51% 99.32% 99.51% 99.40%
Genus 98.33% 97.89% 98.33% 98.01%
Species 95.29% 94.34% 95.29% 94.56%
  • Training Loss: 0.283
  • Validation Loss: 0.606

Intended Use

  • Taxonomy classification in microbiome research and microbial ecology.

Limitations

  • GPU recommended (trained on NVIDIA A40).
  • Lower precision at species level due to label complexity (10,547 classes).

Citation

If you use this model in your research, please cite:

@software{DeepTaxa,
  author = {{Systems Genomics Lab}},
  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/systems-genomics-lab/deeptaxa},
}

Contact

Open an issue on GitHub for support.

Acknowledgements