pranjal-pravesh commited on
Commit
b8479c1
·
verified ·
1 Parent(s): 301ff08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +222 -3
README.md CHANGED
@@ -1,3 +1,222 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: onnxruntime
3
+ tags:
4
+ - deepfake-detection
5
+ - onnx
6
+ - image-classification
7
+ - computer-vision
8
+ - convnext
9
+ - swin-transformer
10
+ license: mit
11
+ ---
12
+
13
+ # GenConViT ED (Encoder-Decoder) - ONNX
14
+
15
+ ONNX conversion of the **Encoder-Decoder (ED)** network from [GenConViT: Generative Convolutional Vision Transformer for Deepfake Video Detection](https://www.mdpi.com/2076-3417/15/12/6622).
16
+
17
+ Converted from the official PyTorch weights released by the authors at [erprogs/GenConViT](https://github.com/erprogs/GenConViT).
18
+
19
+ ## Model Description
20
+
21
+ GenConViT is a hybrid architecture for deepfake video detection that combines:
22
+
23
+ - A **CNN Encoder-Decoder** that learns to reconstruct the input face image
24
+ - A **ConvNeXt-Tiny + Swin Transformer-Tiny** backbone (via hybrid patch embedding) that extracts features from both the reconstructed and original images
25
+ - A classification head that concatenates both feature vectors and outputs a binary REAL/FAKE prediction
26
+
27
+ The ED variant is one of two independent networks in the full GenConViT framework (the other being a VAE variant). It processes an input face image through two parallel paths using shared backbone weights, producing a 2-class logit output.
28
+
29
+ ### Architecture
30
+
31
+ ```
32
+ Input (B, 3, 224, 224)
33
+ |
34
+ +---> Encoder (5x Conv2d+ReLU+MaxPool) ---> (B, 256, 7, 7)
35
+ | |
36
+ | v
37
+ | Decoder (5x ConvTranspose2d+ReLU) ---> Reconstructed Image (B, 3, 224, 224)
38
+ | |
39
+ | v
40
+ | ConvNeXt+Swin Backbone ---> 1000-dim features (from reconstruction)
41
+ |
42
+ +---> ConvNeXt+Swin Backbone ---> 1000-dim features (from original)
43
+ |
44
+ v
45
+ Concatenate ---> 2000-dim
46
+ |
47
+ v
48
+ FC(2000, 500) + GELU + FC(500, 2) ---> Output logits (B, 2)
49
+ ```
50
+
51
+ ### Key Details
52
+
53
+ | Property | Value |
54
+ |---|---|
55
+ | Input | RGB image, 224x224, ImageNet-normalized |
56
+ | Output | 2 logits: `[real_score, fake_score]` |
57
+ | Parameters | ~59.5M unique (FP32) |
58
+ | ONNX Opset | 18 |
59
+ | File Size | ~117 MB |
60
+ | Dynamic Batch | Yes |
61
+ | Backbone | ConvNeXt-Tiny |
62
+ | Embedder | Swin Transformer-Tiny (as hybrid patch embedding) |
63
+
64
+ ## Conversion Fidelity
65
+
66
+ Numerical comparison against the original PyTorch model across 100 random inputs:
67
+
68
+ | Metric | Value |
69
+ |---|---|
70
+ | Max Absolute Error | 1.15e-05 |
71
+ | Mean Absolute Error | 4.31e-06 |
72
+ | Mean Relative Error | 0.015% |
73
+ | Classification Agreement | 100% |
74
+
75
+ The conversion is numerically equivalent to the original PyTorch model.
76
+
77
+ ## Usage
78
+
79
+ ### Installation
80
+
81
+ ```bash
82
+ pip install onnxruntime numpy pillow
83
+ ```
84
+
85
+ ### Inference on a Single Image
86
+
87
+ ```python
88
+ import numpy as np
89
+ import onnxruntime as ort
90
+ from PIL import Image
91
+
92
+ # ImageNet normalization constants
93
+ MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
94
+ STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)
95
+
96
+ def preprocess(image_path: str) -> np.ndarray:
97
+ """Load and preprocess a face image for GenConViT."""
98
+ img = Image.open(image_path).convert("RGB").resize((224, 224))
99
+ # HWC uint8 -> CHW float32 [0, 1] -> ImageNet normalized
100
+ arr = np.asarray(img, dtype=np.float32) / 255.0
101
+ arr = np.transpose(arr, (2, 0, 1))[np.newaxis] # (1, 3, 224, 224)
102
+ return (arr - MEAN) / STD
103
+
104
+ def predict(session: ort.InferenceSession, image_path: str) -> tuple[str, float]:
105
+ """Run prediction on a single face image. Returns (label, confidence)."""
106
+ input_tensor = preprocess(image_path)
107
+ logits = session.run(None, {"input": input_tensor})[0] # (1, 2)
108
+
109
+ scores = 1.0 / (1.0 + np.exp(-logits)) # sigmoid
110
+ mean_scores = scores.mean(axis=0)
111
+
112
+ pred_class = int(np.argmax(mean_scores))
113
+ label = "FAKE" if pred_class == 0 else "REAL"
114
+ confidence = float(mean_scores[pred_class])
115
+ return label, confidence
116
+
117
+ # Load model
118
+ session = ort.InferenceSession("genconvit_ed_inference.onnx")
119
+
120
+ # Predict
121
+ label, confidence = predict(session, "face.jpg")
122
+ print(f"{label} (confidence: {confidence:.4f})")
123
+ ```
124
+
125
+ ### Inference on Video Frames
126
+
127
+ ```python
128
+ import numpy as np
129
+ import onnxruntime as ort
130
+ from PIL import Image
131
+
132
+ MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 3, 1, 1)
133
+ STD = np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 3, 1, 1)
134
+
135
+ def preprocess_frames(face_images: list[np.ndarray]) -> np.ndarray:
136
+ """Preprocess a list of cropped face images (HWC uint8 numpy arrays)."""
137
+ batch = np.stack([
138
+ np.transpose(img.astype(np.float32) / 255.0, (2, 0, 1))
139
+ for img in face_images
140
+ ]) # (N, 3, 224, 224)
141
+ return (batch - MEAN) / STD
142
+
143
+ def predict_video(session: ort.InferenceSession, face_frames: list[np.ndarray]) -> tuple[str, float]:
144
+ """
145
+ Predict on a list of face crops extracted from video frames.
146
+ Each face_frame should be a 224x224 RGB uint8 numpy array.
147
+ """
148
+ input_tensor = preprocess_frames(face_frames)
149
+
150
+ # Run inference frame by frame (or batched if memory allows)
151
+ all_scores = []
152
+ for i in range(len(input_tensor)):
153
+ logits = session.run(None, {"input": input_tensor[i:i+1]})[0]
154
+ scores = 1.0 / (1.0 + np.exp(-logits)) # sigmoid
155
+ all_scores.append(scores[0])
156
+
157
+ all_scores = np.stack(all_scores) # (N, 2)
158
+ mean_scores = all_scores.mean(axis=0)
159
+
160
+ pred_class = int(np.argmax(mean_scores))
161
+ label = "FAKE" if pred_class == 0 else "REAL"
162
+ confidence = float(mean_scores[pred_class])
163
+ return label, confidence
164
+
165
+ # Example usage:
166
+ # session = ort.InferenceSession("genconvit_ed_inference.onnx")
167
+ # face_crops = [...] # list of 224x224 RGB numpy arrays from face detection
168
+ # label, confidence = predict_video(session, face_crops)
169
+ ```
170
+
171
+ ## Preprocessing Requirements
172
+
173
+ The model expects **cropped face images**, not raw frames. You must run face detection before inference:
174
+
175
+ 1. Extract frames from video
176
+ 2. Detect and crop faces (e.g., using `face_recognition`, `dlib`, `mediapipe`, or any face detector)
177
+ 3. Resize each face crop to **224x224** RGB
178
+ 4. Normalize with ImageNet stats: `mean=[0.485, 0.456, 0.406]`, `std=[0.229, 0.224, 0.225]`
179
+
180
+ ## Output Interpretation
181
+
182
+ The model outputs 2 raw logits: `[score_0, score_1]`.
183
+
184
+ After applying sigmoid and averaging across frames:
185
+ - `argmax == 0` (score_0 > score_1) -> **FAKE**
186
+ - `argmax == 1` (score_1 > score_0) -> **REAL**
187
+
188
+ This follows the original GenConViT convention where class 0 = FAKE and class 1 = REAL (then the label is flipped via `prediction ^ 1` in the original code; the logic above already accounts for this).
189
+
190
+ ## Training Data and Performance
191
+
192
+ The original model was trained and evaluated on:
193
+
194
+ | Dataset | Accuracy | AUC |
195
+ |---|---|---|
196
+ | DFDC | - | - |
197
+ | FaceForensics++ | - | - |
198
+ | Celeb-DF v2 | - | - |
199
+ | DeepfakeTIMIT | - | - |
200
+
201
+ Average across datasets: **95.8% accuracy**, **99.3% AUC** (as reported in the paper for the full GenConViT ensemble). Individual ED network results may differ.
202
+
203
+ ## Citation
204
+
205
+ ```bibtex
206
+ @article{wodajo2023genconvit,
207
+ title={Deepfake Video Detection Using Generative Convolutional Vision Transformer},
208
+ author={Wodajo, Deressa and Mareen, Hannes and Lambert, Peter and Atnafu, Solomon and Akhtar, Zahid and Van Wallendael, Glenn},
209
+ journal={Applied Sciences},
210
+ volume={15},
211
+ number={12},
212
+ pages={6622},
213
+ year={2025},
214
+ publisher={MDPI},
215
+ doi={10.3390/app15126622}
216
+ }
217
+ ```
218
+
219
+ ## Acknowledgements
220
+
221
+ - Original model and training by [Deressa Wodajo et al.](https://github.com/erprogs/GenConViT)
222
+ - Conversion to ONNX by [Pranjal Pravesh](https://github.com/pranjal-pravesh)