metadata
language: en
tags:
- bioinformatics
- microbiology
- microbiome
- taxonomy-classification
- deep-learning
- 16s-rrna
datasets:
- systems-genomics-lab/greengenes
metrics:
- accuracy
- precision
- recall
- f1
license: mit
model-index:
- name: DeepTaxa Hybrid CNN-BERT (April 2025)
results:
- task:
type: classification
name: Hierarchical Taxonomy Classification
dataset:
type: systems-genomics-lab/greengenes
name: Greengenes (2024-09 Validation Split)
split: validation
metrics:
- type: accuracy
value: 0.9999258655200534
name: Domain Accuracy
- type: accuracy
value: 0.9992339437072182
name: Phylum Accuracy
- type: accuracy
value: 0.9988879828008006
name: Class Accuracy
- type: accuracy
value: 0.9971581782687128
name: Order Accuracy
- type: accuracy
value: 0.9950824128302074
name: Family Accuracy
- type: accuracy
value: 0.9833444535053253
name: Genus Accuracy
- type: accuracy
value: 0.9528751822472632
name: Species Accuracy
DeepTaxa: Hybrid CNN-BERT Model (April 2025)
DeepTaxa is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
Model Details
- Architecture: HybridCNNBERTClassifier (CNN + BERT)
- Tokenizer:
zhihan1996/DNABERT-2-117M - Training Data: Greengenes dataset (2024-09 split)
- Levels Predicted: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
- Total Parameters: 72,635,154
- Max Sequence Length: 512
- Dropout Probability: 0.2
- License: MIT
- Version: April 2025
- File:
deeptaxa_april_2025.pt
Usage
Download the Model
To get started, download the pre-trained model file deeptaxa_april_2025.pt from this repository:
- Manual Download: Visit https://huggingface.co/systems-genomics-lab/deeptaxa, click on the "Files and versions" tab, and download
deeptaxa_april_2025.pt(871 MB). - Command Line (wget):
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt - Command Line (git clone):
git clone https://huggingface.co/systems-genomics-lab/deeptaxa cd deeptaxa # The model file is now in the current directory
Run Predictions
Once downloaded, use the model with the DeepTaxa CLI:
python -m deeptaxa.cli predict \
--fasta-file /path/to/sequences.fna.gz \
--checkpoint deeptaxa_april_2025.pt
Full instructions are available on the GitHub repository.
Training Details
- Dataset: 161,866 training sequences, 40,467 validation sequences from Greengenes (
gg_2024_09_training.fna.gz,gg_2024_09_training.tsv.gz) - Hyperparameters:
- Learning Rate: 0.0001
- Batch Size: 16
- Epochs: 10
- Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
- Focal Loss Gamma: 2.0
- Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
- Training Time: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
- Timestamp: Trained on 2025-04-04
Performance
Validation metrics (on 40,467 sequences):
| Level | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Domain | 99.99% | 99.99% | 99.99% | 99.99% |
| Phylum | 99.92% | 99.92% | 99.92% | 99.92% |
| Class | 99.89% | 99.85% | 99.89% | 99.87% |
| Order | 99.72% | 99.64% | 99.72% | 99.67% |
| Family | 99.51% | 99.32% | 99.51% | 99.40% |
| Genus | 98.33% | 97.89% | 98.33% | 98.01% |
| Species | 95.29% | 94.34% | 95.29% | 94.56% |
- Training Loss: 0.283
- Validation Loss: 0.606
Intended Use
- Taxonomy classification in microbiome research and microbial ecology.
Limitations
- GPU recommended (trained on NVIDIA A40).
- Lower precision at species level due to label complexity (10,547 classes).
Citation
If you use this model in your research, please cite:
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
Contact
Open an issue on GitHub for support.
Acknowledgements
- Dr. Olaitan I. Awe and the Omics Codeathon team for their mentorship and contributions.
- Hugging Face for providing a platform to host datasets and models.
- The High-Performance Computing Team of the School of Sciences and Engineering (SSE) at the American University in Cairo (AUC) for their support and for granting access to GPU resources that enabled this work.