Pascal-TriheadNet: Joint Detection & Segmentation
Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.
Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.
π View Full Code & Documentation on GitHub
π Key Highlights
- Detection mAP@50: 75.6%
- Semantic mIoU: 87.3%
- Instance Mask mAP@50: 65.7%
- Architecture: One Backbone, One Neck, Three Heads (ViT + FPN)
π₯ Model Checkpoints
Two versions of the model are provided:
| File | Description | Size |
|---|---|---|
checkpoint_epoch_50.pth |
Best performing FP32 model. | 826MB |
checkpoint_epoch_50_quantized.pth |
optimized INT8 Quantized model. | 136MB |
Training Context: Model was fine-tuned on an L4 GPU in Google Colab.
π Performance Metrics
Evaluated on the Pascal VOC 2012 Validation set:
| Task | Metric | Score |
|---|---|---|
| Detection | mAP (0.5:0.95) | 46.7% |
| Detection | mAP@50 | 75.6% |
| Semantic | mIoU | 87.3% |
| Instance | Mask mAP (0.5:0.95) | 35.8% |
| Instance | Mask mAP@50 | 65.7% |
For detailed per-class analysis and ablation studies, please refer to the GitHub Repository.
π Model Overview
The architecture utilizes a Vision Transformer (ViT-Base) backbone pretrained on ImageNet.
- Backbone:
vit_base_patch16_224with the last 6 blocks fine-tuned. - Neck: A Simple Feature Pyramid (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
- Heads:
- Detection: FCOS-style anchor-free detector.
- Semantic: Panoptic FPN-style segmentation head.
- Instance: Mask R-CNN-style head using RoI Align.
βοΈ Training Configuration
- Epochs: 50
- Batch Size: 32
- Optimizer: AdamW (Base LR: 2e-4)
- Loss: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).
Model Details
- Developed by: Sivasubiramaniam Subbiah
- Model type: Multi-task Vision Model
- Language(s): Python, PyTorch
- License: MIT
- Finetuned from: Vision Transformer (ViT)