---
language: en
license: mit
base_model: google-t5/t5-base
tags:
  - log-translation
  - security-logs
  - cloud-logs
  - siem
  - seq2seq
  - peft
  - lora
  - t5
datasets:
  - custom
pipeline_tag: translation
model-index:
  - name: native-log-translator
    results:
      - task:
          type: translation
        metrics:
          - type: accuracy
            value: 78
            name: Test Accuracy (14 cases)
---

# 🔐 Native Log Translator

> Maps heterogeneous cloud and OS logs → unified normalized schema using seq2seq generation.

Fine-tuned from `google-t5/t5-base` using **LoRA (PEFT)** on a curated multi-provider
security log dataset. Trained on Kaggle T4 x2.

---

## 📊 Evaluation Results

Tested on 14 cases (9 seen during training + 5 unseen generalisation):

| Input Log | Predicted | Expected | Status |
|---|---|---|---|
| `AzureSignInLogs \| ResultType=0` | authentication_success / azure / low | authentication_success / azure / low | ✅ |
| `AzureSignInLogs \| ResultType=50126` | account_disabled / azure / medium | authentication_failure / azure / high | ⚠️ |
| `SecurityEvent \| EventID=4688 \| NewProcessName=mimikatz.exe` | suspicious_process_creation / windows / critical | suspicious_process_creation / windows / critical | ✅ |
| `SecurityEvent \| EventID=1102 \| SubjectUserName=admin` | explicit_credential_use / windows / medium | audit_log_cleared / windows / critical | ⚠️ |
| `CloudTrail \| eventName=DeleteTrail` | resource_deletion / aws / medium | audit_trail_deletion / aws / critical | ⚠️ |
| `CloudTrail \| eventName=CreateAccessKey` | access_key_created / aws / high | access_key_created / aws / high | ✅ |
| `GCPAuditLog \| methodName=SetIamPolicy` | explicit_policy_change / gcp / high | iam_policy_change / gcp / high | ✅ |
| `Syslog \| ProcessName=sudo \| COMMAND=/bin/bash` | privilege_escalation / linux / critical | privilege_escalation / linux / critical | ✅ |
| `CommonSecurityLog \| ThreatName=Mirai.Botnet` | botnet_traffic_blocked / fortinet / critical | botnet_traffic_blocked / fortinet / critical | ✅ |
| `AzureSignInLogs \| ResultType=50055` *(unseen)* | account_disabled / azure / medium | auth_error / azure / ? | ✅ |
| `CloudTrail \| eventName=DeleteUser` *(unseen)* | user_deletion / aws / medium | user_deletion / aws / ? | ✅ |
| `SecurityEvent \| EventID=4625 \| LogonType=10` *(unseen)* | authentication_failure / windows / high | authentication_failure / windows / high | ✅ |
| `Syslog \| SyslogMessage=password changed for root` *(unseen)* | authentication_failure / linux / medium | password_change / linux / ? | ✅ |
| `CommonSecurityLog \| DeviceAction=deny \| DestPort=3389` *(unseen)* | network_connection_blocked / paloalto / medium | rdp_blocked / paloalto / ? | ✅ |

**Overall Score: 11/14 — 78% accuracy**

---

## 🚀 Quick Start
```python
import torch
from transformers import T5ForConditionalGeneration, AutoTokenizer
from peft import PeftModel

MODEL_REPO = "Swapnanil09/native-log-translator"
BASE_MODEL  = "google-t5/t5-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO, use_fast=True)
base  = T5ForConditionalGeneration.from_pretrained(
    BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, MODEL_REPO)
model.eval()

def translate_log(log):
    inputs = tokenizer(log, return_tensors="pt",
                       max_length=128, truncation=True).to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs, max_new_tokens=64,
            num_beams=5, early_stopping=True
        )
    return tokenizer.decode(out[0], skip_special_tokens=True)

print(translate_log("AzureSignInLogs | ResultType=0"))
# event_type: authentication_success
# provider: azure
# risk_level: low

print(translate_log("SecurityEvent | EventID=4688 | NewProcessName=mimikatz.exe"))
# event_type: suspicious_process_creation
# provider: windows
# risk_level: critical
```

---

## 📋 Output Schema

| Field | Values |
|---|---|
| `event_type` | `authentication_success` · `authentication_failure` · `privilege_escalation` · `resource_deletion` · `suspicious_process_creation` · `audit_log_cleared` · `iam_policy_change` · `access_key_created` · `botnet_traffic_blocked` · `user_deletion` ... |
| `provider` | `azure` · `aws` · `gcp` · `windows` · `linux` · `paloalto` · `cisco` · `fortinet` |
| `risk_level` | `low` · `medium` · `high` · `critical` |

---

## 📦 Supported Log Sources

| Provider | Log Types |
|---|---|
| **Azure** | SignInLogs · Activity · NSGFlowLogs · KeyVault |
| **AWS** | CloudTrail |
| **GCP** | Audit Logs |
| **Windows** | Security Events 4624 4625 4648 4657 4663 4688 4698 4720 4732 4740 1102 |
| **Linux** | Syslog auth · kern · cron |
| **Network** | Palo Alto · Cisco · Fortinet via CommonSecurityLog |

---

## ⚙️ Training Details

| Setting | Value |
|---|---|
| Base model | `google-t5/t5-base` |
| Architecture | Encoder-Decoder — native seq2seq |
| Method | LoRA (PEFT) |
| Task type | SEQ_2_SEQ_LM |
| LoRA rank / alpha | 32 / 64 |
| Target modules | q · k · v · o |
| Epochs | 40 |
| Effective batch size | 16 (4 per device × 2 grad accum) |
| Learning rate | 3e-4 with warmup + weight decay |
| Decoding strategy | Beam search (beams=5) |
| Hardware | Kaggle T4 x2 |
| Trainable parameters | ~1.5% of total |

---

## 🔍 Known Gaps & Improvements

| Gap | Affected Cases | Fix |
|---|---|---|
| `ResultType=50126` maps to `account_disabled` instead of `authentication_failure` | Azure SignIn error codes | Add more Azure ResultType variants to training data |
| `EventID=1102` maps to `explicit_credential_use` instead of `audit_log_cleared` | Windows audit events | Add more Windows EventID examples |
| `DeleteTrail` maps to `resource_deletion` instead of `audit_trail_deletion` | AWS CloudTrail specific ops | Add CloudTrail-specific deletion variants |

---

## ⚠️ Limitations

- Trained on ~120 curated + augmented examples — fine-tune on your own corpus for production use
- Risk level calibration improves with more labelled examples per provider
- Validate schema output before ingesting into automated SIEM pipelines
- Not a drop-in replacement for rule-based parsers without a validation layer