DINOv3 β YOLO11 Distilled OCR Detector
This repository contains a YOLO11-based OCR object detector distilled from a DINOv3 ViT-B/16 teacher using LightlyTrain.
The goal: produce a lightweight but high-recall text box detector suitable for OCR, ID scanning, document parsing, and multi-language text extraction.
Model Summary
- Teacher:
dinov3/vitb16 - Student:
YOLO11s(custom convolutional backbone) - Method: LightlyTrain
distillation(features-only MSE loss) - Data: 1,200 unlabeled resume-like document crops + synthetic webpage/document images
- Use-case: OCR region detection (not recognition)
- Export Format: Ultralytics
.pt - File:
exported_models/exported_last.pt
Intended Use
This model is trained to detect text regions inside real-world documents:
- CVs / resumes
- ID cards
- Business documents
- Screenshots
- Webpage fragments
- PDF pages (converted to images)
It does not perform OCR itself β recognition should be done with a second-stage model (Tesseract, TrOCR, Nougat, PaddleOCR, VietOCR, etc.)
Example Usage
Python (Ultralytics)
from ultralytics import YOLO
model = YOLO("exported_last.pt")
results = model("/content/example.jpg")
results[0].show() # visualize text boxes
Extract BB
boxes = results[0].boxes.xyxy.cpu().numpy() confs = results[0].boxes.conf.cpu().numpy()
for xyxy, conf in zip(boxes, confs): print(xyxy, conf)
Distillation
lightly_train.train( out="dinov3_yolo11_distilled", data="/content/unlabeled_idl_images", model="yolo11s", method="distillation", method_args={ "teacher": "dinov3/vitb16", "teacher_weights": "/content/dinov3_vitb16_pretrain.pth" }, epochs=2, batch_size=4, precision="16-mixed" )