AI Safety & Media Forensics

Multi-Modal Deepfake Detection System

7-Model Heterogeneous Ensemble · PyTorch + TensorFlow

Completed
AI Safety & Media Forensics
72.3%
Ensemble Accuracy
7 (1 PyTorch + 6 TF)
Models in Ensemble
64 @ 2 FPS
Frames / Clip
~35 s CPU
Inference (60 s clip)

Overview

A heterogeneous ensemble combining one PyTorch model and six TensorFlow models behind a unified loader: Pinpoint (audio-visual transformer; ResNet18 visual extractor + Conv1D/GRU MFCC audio extractor + 8-head gated cross-attention + transformer encoder, weight 1.3) for lip-sync and temporal coherence; EfficientNet-B4 (1.2) for compression and frequency-domain artifacts; ResNet-50 v1 and v2 (1.0 each) for deep facial features and spatial patterns; VGG-16 v1 and v2 (0.9 each) for texture, edge, and gradient cues; InceptionV3 (1.1) for multi-scale semantic consistency. Configurable ensemble groups (`default`, `fast`, `single`, `maximum_accuracy`, `visual_only`) let callers trade latency for accuracy. A FastAPI inference service extracts 64 frames at 2 FPS and a 16 kHz audio track, runs the selected models, and combines them with weighted averaging plus an inter-model std-dev confidence label. The frontend renders per-frame scores, attention heatmaps, mel spectrograms, and a model-consensus view. A separate credits microservice and an Nginx-fronted website complete a three-container Docker Compose stack, deployable to AWS via CloudFormation with optional S3 history and Cognito OIDC auth.

The Problem

Single-modality deepfake detectors over-fit to the generator they were trained on and collapse the moment a new synthesis pipeline appears. Real misuse — political clips, fraud calls, fabricated evidence — almost always pairs manipulated video with re-timed or synthesized audio, so an image-only classifier is half-blind. The system also had to ship as something a non-ML reviewer could actually use: a calibrated verdict, the frames that drove it, and an audit trail.

The Approach

Seven specialized models run behind one ensemble loader. The PyTorch Pinpoint network encodes 64 video frames through a ResNet18 visual extractor, encodes 16 kHz audio as MFCCs through a Conv1D + GRU stack, fuses both streams via an 8-head gated cross-attention block (sigmoid gate over an audio→video attention map) and a transformer encoder (3 layers, 256-dim, 8 heads). Six TensorFlow CNNs — EfficientNet-B4, ResNet-50 v1/v2, VGG-16 v1/v2, and InceptionV3 — cover spatial/texture/gradient/multi-scale artifacts at their native input resolutions (224 or 299). Per-model weights (Pinpoint 1.3, EffNet 1.2, InceptionV3 1.1, ResNets 1.0, VGGs 0.9) drive a weighted average; std across models maps to a 5-level confidence label. Configurable ensemble groups (`default`, `fast`, `single`, `maximum_accuracy`, `visual_only`) let callers trade latency for accuracy or fall back gracefully when audio is missing. The system is split into three Docker services — Detection Engine (FastAPI + PyTorch + TensorFlow), Website (Nginx + vanilla-JS Canvas visualizations), and Credits (FastAPI + SQLite) — and deploys to AWS via CloudFormation with optional S3 history and Cognito OIDC.

Results

72.3% ensemble accuracy on FaceForensics++ (c23), outperforming each single model and exceeding published single-model baselines on the same split (XceptionNet 65.3%, EfficientNet-B4 68.9%, Multi-Attentional 70.1%, ViT 69.5%). The Pinpoint audio-visual head meaningfully closes the gap on lip-sync attacks like Face2Face and Wav2Lip; the `visual_only` ensemble degrades gracefully on muted clips. End-to-end inference is ~35 s for a 60 s clip on CPU and ~10× faster on a 6 GB GPU. Published in IJSRNSC 2024 and operable end-to-end via Docker Compose or a one-click CloudFormation deploy.

Process & Timeline

  1. Phase 1

    TensorFlow CNN bench

    Fine-tuned EfficientNet-B4, ResNet-50 (v1/v2), VGG-16 (v1/v2), and InceptionV3 on FaceForensics++ c23 with framework-specific input pipelines and per-model focus areas (texture, edges, multi-scale, frequency).

  2. Phase 2

    Pinpoint transformer (PyTorch)

    Built a ResNet18 visual extractor + MFCC Conv1D/GRU audio extractor fused through an 8-head gated cross-attention block and a 3-layer transformer encoder for lip-sync and audio-visual coherence.

  3. Phase 3

    Heterogeneous ensemble loader

    Wrote a single loader that mixes PyTorch and TensorFlow detectors, applies per-model weights, exposes named ensemble groups (`default`, `fast`, `single`, `maximum_accuracy`, `visual_only`), and emits a std-dev-based confidence label.

  4. Phase 4

    Inference service & UI

    FastAPI engine extracting 64 frames @ 2 FPS plus 16 kHz audio; vanilla-JS Canvas frontend rendering per-frame scores, heatmaps, mel spectrograms, and model-consensus views.

  5. Phase 5

    Productionization & cloud

    Three-container Docker Compose stack (engine, website, credits), CloudFormation template for AWS, S3-backed history, and optional Cognito OIDC authentication.

Like what you see?

I'm always open to collaborations on AI, robotics, edge computing, or embedded systems.