CLIP_aievals: AI–Generated Image Detector
This model is a CLIP-based classifier fine-tuned to detect AI-generated images across a wide range of generative models. It is trained using a mixture of real datasets (FFHQ, COCO, ImageNet, AFHQ, etc.) and synthetic datasets from diffusion, GANs, and hybrid architectures.
Overview
CLIP_aievals is designed for robust AI-vs-Real detection by leveraging a CLIP Vision Transformer backbone and a lightweight classification head. It is optimized for generalization across unseen generative sources and large-scale evaluation pipelines.
This repository contains the model weights (clip_vith14_argus.pt) and supporting configuration files used for inference.
Model Architecture
Backbone
- CLIP ViT-H/14 vision encoder
- Pretrained on LAION-2B
- Frozen or partially unfrozen depending on training configuration
Classifier Head
Two-layer MLP:
- Input: CLIP image embedding (1024-d)
- Hidden Layer: 512 with GELU activation
- Output Layer: 1-unit sigmoid classifier producing probability of AI-generated content
Regularization and Calibration
- Dropout: 0.1
- Weight decay: 1e-4
- Temperature calibration performed post-hoc using validation logits
- Optional threshold tuning using Eval metrics or Unknown-source analysis
Training Objective
- Binary cross-entropy
- Oversampling and class-balancing for multi-source synthetic datasets
Datasets
The training pipeline uses a mixture of curated datasets:
Real Data
- FFHQ (70k)
- COCO (160k)
- ImageNet (90k+)
- AFHQ v1/v2 (cats, dogs, wildlife)
- DIV2K
- OpenImages
Fake Data
- Stable Diffusion (v1.x, v2.x)
- Latent Diffusion Models
- StyleGAN3
- CIPS
- BigGAN
- GANformer
- CycleGAN (horse2zebra, monet2photo)
- DDPM and DDGAN
- Face Synthetics
- Glide
- Generative Inpainting (partial and full)
Labels are binary: 0 = real, 1 = fake.
Performance Summary
Evaluated on 850k+ mixed-source images:
- ROC-AUC: 0.764
- PR-AUC (AI class): 0.612
- Global FPR (real images): 0.0073
- Accuracy: 0.693
- Precision (AI): 0.853
- Recall (AI): 0.086
Performance is dataset-dependent: high confidence on many synthetic sources, lower recall on advanced diffusion models exhibiting strong photorealism.
Intended Use
Primary
- Detect whether an image is AI-generated
- Large-scale offline evaluation of generative models
- Data filtering for dataset curation
- Quality and authenticity control in multimedia pipelines
Secondary
- Research on generative model detection
- Cross-model robustness evaluation
Not Intended For
- Legal or forensic verification
- High-stakes decision systems
- Per-pixel or localized artifact detection
Limitations
Lower recall on highly realistic diffusion models.
Model can produce false positives on:
- Overprocessed images
- Heavy JPEG compression
- Artistic filters
Not calibrated for forensic authenticity analysis.
How to Use
In Python
from src.model import AIImageDetector
from PIL import Image
import torch
model = AIImageDetector(
clip_model_name="ViT-H-14",
device="cuda",
dropout=0.1
)
model.load_state_dict(torch.load("clip_vith14_argus.pt", map_location="cpu"))
model.eval()
img = Image.open("your_image.jpg")
prob = model.predict(img) # returns probability of AI generation
print(prob)