pranjal-pravesh
/

GenConViT-onnx

+---
+library_name: onnxruntime
+tags:
+  - deepfake-detection
+  - onnx
+  - image-classification
+  - computer-vision
+  - convnext
+  - swin-transformer
+license: mit
+---
+# GenConViT ED (Encoder-Decoder) - ONNX
+ONNX conversion of the **Encoder-Decoder (ED)** network from [GenConViT: Generative Convolutional Vision Transformer for Deepfake Video Detection](https://www.mdpi.com/2076-3417/15/12/6622).
+Converted from the official PyTorch weights released by the authors at [erprogs/GenConViT](https://github.com/erprogs/GenConViT).
+## Model Description
+GenConViT is a hybrid architecture for deepfake video detection that combines:
+- A **CNN Encoder-Decoder** that learns to reconstruct the input face image
+- A **ConvNeXt-Tiny + Swin Transformer-Tiny** backbone (via hybrid patch embedding) that extracts features from both the reconstructed and original images
+- A classification head that concatenates both feature vectors and outputs a binary REAL/FAKE prediction
+The ED variant is one of two independent networks in the full GenConViT framework (the other being a VAE variant). It processes an input face image through two parallel paths using shared backbone weights, producing a 2-class logit output.
+### Architecture
+```
+Input (B, 3, 224, 224)
+  |
+  +---> Encoder (5x Conv2d+ReLU+MaxPool) ---> (B, 256, 7, 7)
+  |       |
+  |       v
+  |     Decoder (5x ConvTranspose2d+ReLU) ---> Reconstructed Image (B, 3, 224, 224)
+  |       |
+  |       v
+  |     ConvNeXt+Swin Backbone ---> 1000-dim features (from reconstruction)
+  |
+  +---> ConvNeXt+Swin Backbone ---> 1000-dim features (from original)
+          |
+          v
+        Concatenate ---> 2000-dim
+          |
+          v
+        FC(2000, 500) + GELU + FC(500, 2) ---> Output logits (B, 2)
+```
+### Key Details
+| Property | Value |
+|---|---|
+| Input | RGB image, 224x224, ImageNet-normalized |
+| Output | 2 logits: `[real_score, fake_score]` |
+| Parameters | ~59.5M unique (FP32) |
+| ONNX Opset | 18 |
+| File Size | ~117 MB |
+| Dynamic Batch | Yes |
+| Backbone | ConvNeXt-Tiny |
+| Embedder | Swin Transformer-Tiny (as hybrid patch embedding) |
+## Conversion Fidelity
+Numerical comparison against the original PyTorch model across 100 random inputs:
+| Metric | Value |
+|---|---|
+| Max Absolute Error | 1.15e-05 |
+| Mean Absolute Error | 4.31e-06 |
+| Mean Relative Error | 0.015% |
+| Classification Agreement | 100% |
+The conversion is numerically equivalent to the original PyTorch model.
+## Usage
+### Installation
+```bash
+pip install onnxruntime numpy pillow
+```
+### Inference on a Single Image
+```python
+import numpy as np
+import onnxruntime as ort
+from PIL import Image
+# ImageNet normalization constants
+MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
+STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)
+def preprocess(image_path: str) -> np.ndarray:
+    """Load and preprocess a face image for GenConViT."""
+    img = Image.open(image_path).convert("RGB").resize((224, 224))
+    # HWC uint8 -> CHW float32 [0, 1] -> ImageNet normalized
+    arr = np.asarray(img, dtype=np.float32) / 255.0
+    arr = np.transpose(arr, (2, 0, 1))[np.newaxis]  # (1, 3, 224, 224)
+    return (arr - MEAN) / STD
+def predict(session: ort.InferenceSession, image_path: str) -> tuple[str, float]:
+    """Run prediction on a single face image. Returns (label, confidence)."""
+    input_tensor = preprocess(image_path)
+    logits = session.run(None, {"input": input_tensor})[0]  # (1, 2)
+    scores = 1.0 / (1.0 + np.exp(-logits))  # sigmoid
+    mean_scores = scores.mean(axis=0)
+    pred_class = int(np.argmax(mean_scores))
+    label = "FAKE" if pred_class == 0 else "REAL"
+    confidence = float(mean_scores[pred_class])
+    return label, confidence
+# Load model
+session = ort.InferenceSession("genconvit_ed_inference.onnx")
+# Predict
+label, confidence = predict(session, "face.jpg")
+print(f"{label} (confidence: {confidence:.4f})")
+```
+### Inference on Video Frames
+```python
+import numpy as np
+import onnxruntime as ort
+from PIL import Image
+MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
+STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)
+def preprocess_frames(face_images: list[np.ndarray]) -> np.ndarray:
+    """Preprocess a list of cropped face images (HWC uint8 numpy arrays)."""
+    batch = np.stack([
+        np.transpose(img.astype(np.float32) / 255.0, (2, 0, 1))
+        for img in face_images
+    ])  # (N, 3, 224, 224)
+    return (batch - MEAN) / STD
+def predict_video(session: ort.InferenceSession, face_frames: list[np.ndarray]) -> tuple[str, float]:
+    """
+    Predict on a list of face crops extracted from video frames.
+    Each face_frame should be a 224x224 RGB uint8 numpy array.
+    """
+    input_tensor = preprocess_frames(face_frames)
+    # Run inference frame by frame (or batched if memory allows)
+    all_scores = []
+    for i in range(len(input_tensor)):
+        logits = session.run(None, {"input": input_tensor[i:i+1]})[0]
+        scores = 1.0 / (1.0 + np.exp(-logits))  # sigmoid
+        all_scores.append(scores[0])
+    all_scores = np.stack(all_scores)  # (N, 2)
+    mean_scores = all_scores.mean(axis=0)
+    pred_class = int(np.argmax(mean_scores))
+    label = "FAKE" if pred_class == 0 else "REAL"
+    confidence = float(mean_scores[pred_class])
+    return label, confidence
+# Example usage:
+# session = ort.InferenceSession("genconvit_ed_inference.onnx")
+# face_crops = [...]  # list of 224x224 RGB numpy arrays from face detection
+# label, confidence = predict_video(session, face_crops)
+```
+## Preprocessing Requirements
+The model expects **cropped face images**, not raw frames. You must run face detection before inference:
+1. Extract frames from video
+2. Detect and crop faces (e.g., using `face_recognition`, `dlib`, `mediapipe`, or any face detector)
+3. Resize each face crop to **224x224** RGB
+4. Normalize with ImageNet stats: `mean=[0.485, 0.456, 0.406]`, `std=[0.229, 0.224, 0.225]`
+## Output Interpretation
+The model outputs 2 raw logits: `[score_0, score_1]`.
+After applying sigmoid and averaging across frames:
+- `argmax == 0` (score_0 > score_1) -> **FAKE**
+- `argmax == 1` (score_1 > score_0) -> **REAL**
+This follows the original GenConViT convention where class 0 = FAKE and class 1 = REAL (then the label is flipped via `prediction ^ 1` in the original code; the logic above already accounts for this).
+## Training Data and Performance
+The original model was trained and evaluated on:
+| Dataset | Accuracy | AUC |
+|---|---|---|
+| DFDC | - | - |
+| FaceForensics++ | - | - |
+| Celeb-DF v2 | - | - |
+| DeepfakeTIMIT | - | - |
+Average across datasets: **95.8% accuracy**, **99.3% AUC** (as reported in the paper for the full GenConViT ensemble). Individual ED network results may differ.
+## Citation
+```bibtex
+@article{wodajo2023genconvit,
+    title={Deepfake Video Detection Using Generative Convolutional Vision Transformer},
+    author={Wodajo, Deressa and Mareen, Hannes and Lambert, Peter and Atnafu, Solomon and Akhtar, Zahid and Van Wallendael, Glenn},
+    journal={Applied Sciences},
+    volume={15},
+    number={12},
+    pages={6622},
+    year={2025},
+    publisher={MDPI},
+    doi={10.3390/app15126622}
+}
+```
+## Acknowledgements
+- Original model and training by [Deressa Wodajo et al.](https://github.com/erprogs/GenConViT)
+- Conversion to ONNX by [Pranjal Pravesh](https://github.com/pranjal-pravesh)