Pascal-TriheadNet: Joint Detection & Segmentation

Single-stage unified perception model for Pascal VOC: Detection, Semantic, and Instance Segmentation in one forward pass.

Pascal-TriheadNet is a multi-task learning model that jointly solves three computer vision tasks using a unified Vision Transformer backbone with three specialized task heads. Validated on Pascal VOC 2012, it achieves strong performance across all tasks while maintaining efficient inference.

πŸ”— View Full Code & Documentation on GitHub

πŸš€ Key Highlights

  • Detection mAP@50: 75.6%
  • Semantic mIoU: 87.3%
  • Instance Mask mAP@50: 65.7%
  • Architecture: One Backbone, One Neck, Three Heads (ViT + FPN)

πŸ“₯ Model Checkpoints

Two versions of the model are provided:

File Description Size
checkpoint_epoch_50.pth Best performing FP32 model. 826MB
checkpoint_epoch_50_quantized.pth optimized INT8 Quantized model. 136MB

Training Context: Model was fine-tuned on an L4 GPU in Google Colab.

πŸ“Š Performance Metrics

Evaluated on the Pascal VOC 2012 Validation set:

Task Metric Score
Detection mAP (0.5:0.95) 46.7%
Detection mAP@50 75.6%
Semantic mIoU 87.3%
Instance Mask mAP (0.5:0.95) 35.8%
Instance Mask mAP@50 65.7%

For detailed per-class analysis and ablation studies, please refer to the GitHub Repository.

πŸ— Model Overview

The architecture utilizes a Vision Transformer (ViT-Base) backbone pretrained on ImageNet.

  1. Backbone: vit_base_patch16_224 with the last 6 blocks fine-tuned.
  2. Neck: A Simple Feature Pyramid (ViTDet-style) that creates multi-scale feature maps (P2-P5) from the single-scale ViT output.
  3. Heads:
    • Detection: FCOS-style anchor-free detector.
    • Semantic: Panoptic FPN-style segmentation head.
    • Instance: Mask R-CNN-style head using RoI Align.

βš™οΈ Training Configuration

  • Epochs: 50
  • Batch Size: 32
  • Optimizer: AdamW (Base LR: 2e-4)
  • Loss: Weighted sum of Focal Loss (Det), Cross-Entropy/Dice (Sem/Inst), and GIoU (Box).

Model Details

  • Developed by: Sivasubiramaniam Subbiah
  • Model type: Multi-task Vision Model
  • Language(s): Python, PyTorch
  • License: MIT
  • Finetuned from: Vision Transformer (ViT)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support