--- license: apache-2.0 tags: - domain-generation-algorithm - cybersecurity - domain-classification - security - malware-detection language: - en library_name: transformers pipeline_tag: text-classification base_model: answerdotai/ModernBERT-base --- # ModernBERT DGA Detector This model is designed to classify domains as either legitimate or generated by Domain Generation Algorithms (DGA). ## Model Description - **Model Type:** BERT-based sequence classification - **Task:** Binary classification (Legitimate vs DGA domains) - **Base Model:** ModernBERT-base - **Training Data:** Domain names dataset - **Author:** Reynier Leyva La O, Carlos A. Catania ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("Reynier/modernbert-dga-detector") model = AutoModelForSequenceClassification.from_pretrained("Reynier/modernbert-dga-detector") # Example prediction def predict_domain(domain): inputs = tokenizer(domain, return_tensors="pt", max_length=64, truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) predictions = torch.softmax(outputs.logits, dim=-1) legit_prob = predictions[0][0].item() dga_prob = predictions[0][1].item() return {"prediction": "DGA" if dga_prob > legit_prob else "LEGITIMATE", "confidence": max(legit_prob, dga_prob)} # Test examples domains = ["google.com", "xkvbzpqr.net", "facebook.com", "abcdef123456.com"] for domain in domains: result = predict_domain(domain) print(f"{domain} -> {result['prediction']} (confidence: {result['confidence']:.3f})") ``` ## Model Architecture The model is based on ModernBERT and fine-tuned for domain classification: - Input: Domain names (text) - Output: Binary classification (0=LEGITIMATE, 1=DGA) - Max sequence length: 64 tokens ## Training Details This model was fine-tuned on a dataset of legitimate and DGA-generated domains using: - Base model: answerdotai/ModernBERT-base - Framework: Transformers/PyTorch - Task: Binary sequence classification ## Performance Add your model's performance metrics here when available: - Accuracy: 0.9658 ± 0.0153 - Precision: 0.9704 ± 0.0253 - Recall: 0.9582 ± 0.0147 - F1-Score: 0.9579 ± 0.0167 - FPR: 0.0267 ± 0.0233 - TPR: 0.9582 ± 0.0147 - Query Time 0.1226 ± 0.0253 in CPU do not need GPU ## Use Cases - **Cybersecurity**: Detect malicious domains generated by malware - **Network Security**: Filter potentially harmful domains - **Threat Intelligence**: Analyze domain patterns in security feeds ## Limitations - This model is trained specifically for domain classification - Performance may vary on domains from different TLDs or languages - Regular retraining may be needed as DGA techniques evolve - Model performance depends on the quality and diversity of training data ## Citation If you use this model in your research or applications, please cite it appropriately. ## Related Models Check out the author's other security models: - [Llama3_8B-DGA-Detector](https://huggingface.co/Reynier/Llama3_8B-DGA-Detector)