--- language: en license: mit base_model: google-t5/t5-base tags: - log-translation - security-logs - cloud-logs - siem - seq2seq - peft - lora - t5 datasets: - custom pipeline_tag: translation model-index: - name: native-log-translator results: - task: type: translation metrics: - type: accuracy value: 78 name: Test Accuracy (14 cases) --- # 🔐 Native Log Translator > Maps heterogeneous cloud and OS logs → unified normalized schema using seq2seq generation. Fine-tuned from `google-t5/t5-base` using **LoRA (PEFT)** on a curated multi-provider security log dataset. Trained on Kaggle T4 x2. --- ## 📊 Evaluation Results Tested on 14 cases (9 seen during training + 5 unseen generalisation): | Input Log | Predicted | Expected | Status | |---|---|---|---| | `AzureSignInLogs \| ResultType=0` | authentication_success / azure / low | authentication_success / azure / low | ✅ | | `AzureSignInLogs \| ResultType=50126` | account_disabled / azure / medium | authentication_failure / azure / high | ⚠️ | | `SecurityEvent \| EventID=4688 \| NewProcessName=mimikatz.exe` | suspicious_process_creation / windows / critical | suspicious_process_creation / windows / critical | ✅ | | `SecurityEvent \| EventID=1102 \| SubjectUserName=admin` | explicit_credential_use / windows / medium | audit_log_cleared / windows / critical | ⚠️ | | `CloudTrail \| eventName=DeleteTrail` | resource_deletion / aws / medium | audit_trail_deletion / aws / critical | ⚠️ | | `CloudTrail \| eventName=CreateAccessKey` | access_key_created / aws / high | access_key_created / aws / high | ✅ | | `GCPAuditLog \| methodName=SetIamPolicy` | explicit_policy_change / gcp / high | iam_policy_change / gcp / high | ✅ | | `Syslog \| ProcessName=sudo \| COMMAND=/bin/bash` | privilege_escalation / linux / critical | privilege_escalation / linux / critical | ✅ | | `CommonSecurityLog \| ThreatName=Mirai.Botnet` | botnet_traffic_blocked / fortinet / critical | botnet_traffic_blocked / fortinet / critical | ✅ | | `AzureSignInLogs \| ResultType=50055` *(unseen)* | account_disabled / azure / medium | auth_error / azure / ? | ✅ | | `CloudTrail \| eventName=DeleteUser` *(unseen)* | user_deletion / aws / medium | user_deletion / aws / ? | ✅ | | `SecurityEvent \| EventID=4625 \| LogonType=10` *(unseen)* | authentication_failure / windows / high | authentication_failure / windows / high | ✅ | | `Syslog \| SyslogMessage=password changed for root` *(unseen)* | authentication_failure / linux / medium | password_change / linux / ? | ✅ | | `CommonSecurityLog \| DeviceAction=deny \| DestPort=3389` *(unseen)* | network_connection_blocked / paloalto / medium | rdp_blocked / paloalto / ? | ✅ | **Overall Score: 11/14 — 78% accuracy** --- ## 🚀 Quick Start ```python import torch from transformers import T5ForConditionalGeneration, AutoTokenizer from peft import PeftModel MODEL_REPO = "Swapnanil09/native-log-translator" BASE_MODEL = "google-t5/t5-base" tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO, use_fast=True) base = T5ForConditionalGeneration.from_pretrained( BASE_MODEL, torch_dtype=torch.float16, device_map="auto" ) model = PeftModel.from_pretrained(base, MODEL_REPO) model.eval() def translate_log(log): inputs = tokenizer(log, return_tensors="pt", max_length=128, truncation=True).to(model.device) with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=64, num_beams=5, early_stopping=True ) return tokenizer.decode(out[0], skip_special_tokens=True) print(translate_log("AzureSignInLogs | ResultType=0")) # event_type: authentication_success # provider: azure # risk_level: low print(translate_log("SecurityEvent | EventID=4688 | NewProcessName=mimikatz.exe")) # event_type: suspicious_process_creation # provider: windows # risk_level: critical ``` --- ## 📋 Output Schema | Field | Values | |---|---| | `event_type` | `authentication_success` · `authentication_failure` · `privilege_escalation` · `resource_deletion` · `suspicious_process_creation` · `audit_log_cleared` · `iam_policy_change` · `access_key_created` · `botnet_traffic_blocked` · `user_deletion` ... | | `provider` | `azure` · `aws` · `gcp` · `windows` · `linux` · `paloalto` · `cisco` · `fortinet` | | `risk_level` | `low` · `medium` · `high` · `critical` | --- ## 📦 Supported Log Sources | Provider | Log Types | |---|---| | **Azure** | SignInLogs · Activity · NSGFlowLogs · KeyVault | | **AWS** | CloudTrail | | **GCP** | Audit Logs | | **Windows** | Security Events 4624 4625 4648 4657 4663 4688 4698 4720 4732 4740 1102 | | **Linux** | Syslog auth · kern · cron | | **Network** | Palo Alto · Cisco · Fortinet via CommonSecurityLog | --- ## ⚙️ Training Details | Setting | Value | |---|---| | Base model | `google-t5/t5-base` | | Architecture | Encoder-Decoder — native seq2seq | | Method | LoRA (PEFT) | | Task type | SEQ_2_SEQ_LM | | LoRA rank / alpha | 32 / 64 | | Target modules | q · k · v · o | | Epochs | 40 | | Effective batch size | 16 (4 per device × 2 grad accum) | | Learning rate | 3e-4 with warmup + weight decay | | Decoding strategy | Beam search (beams=5) | | Hardware | Kaggle T4 x2 | | Trainable parameters | ~1.5% of total | --- ## 🔍 Known Gaps & Improvements | Gap | Affected Cases | Fix | |---|---|---| | `ResultType=50126` maps to `account_disabled` instead of `authentication_failure` | Azure SignIn error codes | Add more Azure ResultType variants to training data | | `EventID=1102` maps to `explicit_credential_use` instead of `audit_log_cleared` | Windows audit events | Add more Windows EventID examples | | `DeleteTrail` maps to `resource_deletion` instead of `audit_trail_deletion` | AWS CloudTrail specific ops | Add CloudTrail-specific deletion variants | --- ## ⚠️ Limitations - Trained on ~120 curated + augmented examples — fine-tune on your own corpus for production use - Risk level calibration improves with more labelled examples per provider - Validate schema output before ingesting into automated SIEM pipelines - Not a drop-in replacement for rule-based parsers without a validation layer