Multi-Modal Deepfake Detection System
7-Model Heterogeneous Ensemble · PyTorch + TensorFlow
Overview
A heterogeneous ensemble combining one PyTorch model and six TensorFlow models behind a unified loader: Pinpoint (audio-visual transformer; ResNet18 visual extractor + Conv1D/GRU MFCC audio extractor + 8-head gated cross-attention + transformer encoder, weight 1.3) for lip-sync and temporal coherence; EfficientNet-B4 (1.2) for compression and frequency-domain artifacts; ResNet-50 v1 and v2 (1.0 each) for deep facial features and spatial patterns; VGG-16 v1 and v2 (0.9 each) for texture, edge, and gradient cues; InceptionV3 (1.1) for multi-scale semantic consistency. Configurable ensemble groups (`default`, `fast`, `single`, `maximum_accuracy`, `visual_only`) let callers trade latency for accuracy. A FastAPI inference service extracts 64 frames at 2 FPS and a 16 kHz audio track, runs the selected models, and combines them with weighted averaging plus an inter-model std-dev confidence label. The frontend renders per-frame scores, attention heatmaps, mel spectrograms, and a model-consensus view. A separate credits microservice and an Nginx-fronted website complete a three-container Docker Compose stack, deployable to AWS via CloudFormation with optional S3 history and Cognito OIDC auth.
The Problem
Single-modality deepfake detectors over-fit to the generator they were trained on and collapse the moment a new synthesis pipeline appears. Real misuse — political clips, fraud calls, fabricated evidence — almost always pairs manipulated video with re-timed or synthesized audio, so an image-only classifier is half-blind. The system also had to ship as something a non-ML reviewer could actually use: a calibrated verdict, the frames that drove it, and an audit trail.
The Approach
Seven specialized models run behind one ensemble loader. The PyTorch Pinpoint network encodes 64 video frames through a ResNet18 visual extractor, encodes 16 kHz audio as MFCCs through a Conv1D + GRU stack, fuses both streams via an 8-head gated cross-attention block (sigmoid gate over an audio→video attention map) and a transformer encoder (3 layers, 256-dim, 8 heads). Six TensorFlow CNNs — EfficientNet-B4, ResNet-50 v1/v2, VGG-16 v1/v2, and InceptionV3 — cover spatial/texture/gradient/multi-scale artifacts at their native input resolutions (224 or 299). Per-model weights (Pinpoint 1.3, EffNet 1.2, InceptionV3 1.1, ResNets 1.0, VGGs 0.9) drive a weighted average; std across models maps to a 5-level confidence label. Configurable ensemble groups (`default`, `fast`, `single`, `maximum_accuracy`, `visual_only`) let callers trade latency for accuracy or fall back gracefully when audio is missing. The system is split into three Docker services — Detection Engine (FastAPI + PyTorch + TensorFlow), Website (Nginx + vanilla-JS Canvas visualizations), and Credits (FastAPI + SQLite) — and deploys to AWS via CloudFormation with optional S3 history and Cognito OIDC.
Results
72.3% ensemble accuracy on FaceForensics++ (c23), outperforming each single model and exceeding published single-model baselines on the same split (XceptionNet 65.3%, EfficientNet-B4 68.9%, Multi-Attentional 70.1%, ViT 69.5%). The Pinpoint audio-visual head meaningfully closes the gap on lip-sync attacks like Face2Face and Wav2Lip; the `visual_only` ensemble degrades gracefully on muted clips. End-to-end inference is ~35 s for a 60 s clip on CPU and ~10× faster on a 6 GB GPU. Published in IJSRNSC 2024 and operable end-to-end via Docker Compose or a one-click CloudFormation deploy.
Process & Timeline
- Phase 1
TensorFlow CNN bench
Fine-tuned EfficientNet-B4, ResNet-50 (v1/v2), VGG-16 (v1/v2), and InceptionV3 on FaceForensics++ c23 with framework-specific input pipelines and per-model focus areas (texture, edges, multi-scale, frequency).
- Phase 2
Pinpoint transformer (PyTorch)
Built a ResNet18 visual extractor + MFCC Conv1D/GRU audio extractor fused through an 8-head gated cross-attention block and a 3-layer transformer encoder for lip-sync and audio-visual coherence.
- Phase 3
Heterogeneous ensemble loader
Wrote a single loader that mixes PyTorch and TensorFlow detectors, applies per-model weights, exposes named ensemble groups (`default`, `fast`, `single`, `maximum_accuracy`, `visual_only`), and emits a std-dev-based confidence label.
- Phase 4
Inference service & UI
FastAPI engine extracting 64 frames @ 2 FPS plus 16 kHz audio; vanilla-JS Canvas frontend rendering per-frame scores, heatmaps, mel spectrograms, and model-consensus views.
- Phase 5
Productionization & cloud
Three-container Docker Compose stack (engine, website, credits), CloudFormation template for AWS, S3-backed history, and optional Cognito OIDC authentication.
Like what you see?
I'm always open to collaborations on AI, robotics, edge computing, or embedded systems.