Skip to content

VirusHacks/MisIntel

Repository files navigation

MisIntel

Multi-modal Misinformation & Deepfake Detection Platform

A Next.js 14 + LangChain/LangGraph platform that analyzes text, images, audio, and video through specialized AI agents to detect both misinformation (false claims) and AI-generated content (deepfakes).


1. Problem Statement

What societal issue are we addressing?

The exponential rise of AI-generated misinformation threatens the fabric of informed society. As generative AI becomes more sophisticated, distinguishing authentic content from synthetic or manipulated media has become nearly impossible for average users. This crisis manifests in:

  • Deepfake videos of political figures making fabricated statements
  • AI-generated images used to spread false narratives during elections and crises
  • Voice cloning enabling fraud and impersonation at scale
  • Synthetic news articles flooding social media feeds with fabricated claims
  • Out-of-context media repurposed to mislead audiences

Why does it matter?

Impact Area Consequence
Democracy Manipulated media undermines elections and public trust in institutions
Public Health Medical misinformation leads to vaccine hesitancy and dangerous "cures"
Financial Markets Fake news can manipulate stock prices and cause economic harm
Personal Safety Deepfakes enable harassment, fraud, and non-consensual content
Journalism Authentic reporting becomes indistinguishable from fabrication

The key insight: Content can be AI-generated but factually true, OR real media spreading false claims. MisIntel scores both independently:

Detection Type What It Detects Output
Misinformation False claims, out-of-context content, manipulated facts isMisinfo, misinfoConfidence
Deepfake/AI AI-generated images/audio/video, voice cloning, synthetic media isAiGenerated, aiConfidence

2. System Design

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                           INPUT LAYER                                │
│  Text | URL | Image | Video | Audio                                 │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      PREPROCESSING LAYER                             │
│  • Extract content (OCR, transcription, web scraping)               │
│  • Extract metadata (EXIF, timestamps, source info)                 │
│  • Normalize format                                                  │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    VECTOR DATABASE CHECK (QDRANT)                    │
│  • Search similar content embeddings                                 │
│  • Check fact-check cache                                            │
│  • Check AI pattern cache                                            │
│  • Check source credibility cache                                    │
│                                                                       │
│  IF HIGH SIMILARITY (90%+) → Return cached result                   │
│  IF LOW SIMILARITY → Continue to agents                              │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      SPECIALIZED AGENT SYSTEM                        │
│  (All agents run in parallel)                                        │
│                                                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  MISINFORMATION DETECTION AGENTS                             │    │
│  │  • Fact-Check Agent → Google Fact Check API                  │    │
│  │  • Source Credibility Agent → Domain/author analysis         │    │
│  │  • Safety Agent → URL security check                         │    │
│  │  • Cross-Reference Agent → Multiple source comparison        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  AI GENERATION DETECTION AGENTS                              │    │
│  │  • Image Deepfake Agent → HuggingFace ViT + Gemini Vision    │    │
│  │  • Video Deepfake Agent → Gemini Files API + temporal check  │    │
│  │  • Audio Deepfake Agent → Gemini audio forensics             │    │
│  │  • Source Verification Agent → Reverse search                │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                       │
│  Each agent returns:                                                  │
│  • Verdict (true/false/uncertain)                                    │
│  • Confidence (0-100) — used as voting weight                       │
│  • Evidence list                                                     │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        VOTING SYSTEM                                 │
│                                                                       │
│  • Collect all agent verdicts                                        │
│  • Weighted voting (confidence = weight)                             │
│  • Calculate consensus score                                         │
│                                                                       │
│  IF CONSENSUS ≥ 70% → Return result                                  │
│  IF CONSENSUS < 70% → Trigger adversarial debate (TODO)             │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      RESULT SYNTHESIS                                │
│                                                                       │
│  • Combine voting results + debate outcome                           │
│  • Calculate final scores:                                           │
│    - Misinformation Score (0-100)                                   │
│    - AI Generation Score (0-100)                                    │
│  • Generate evidence report                                          │
│  • Store in Vector DB for future cache                              │
└─────────────────────────────────────────────────────────────────────┘

LangGraph Workflow

The system is built on LangGraph for stateful agent orchestration:

┌─────────────┐    ┌──────────────┐    ┌────────────┐    ┌────────┐    ┌──────────┐
│ Preprocess  │───▶│ Vector DB    │───▶│ Run Agents │───▶│ Voting │───▶│Synthesize│
│ (embedding) │    │ (90%+ cache) │    │ (parallel) │    │(<70%?) │    │ (store)  │
└─────────────┘    └──────────────┘    └────────────┘    └────┬───┘    └──────────┘
                          │                                   │
                          ▼                                   ▼
                   [Use Cached Result]              [Adversarial Debate]

Why Qdrant is Critical

Qdrant is the backbone of our caching and similarity search system:

Use Case How Qdrant Helps
Avoid Re-analysis 90%+ similar content returns cached verdict instantly
Claim Deduplication Same misinformation claim → reuse fact-check result
Pattern Matching Find similar deepfake patterns across analyzed media
Source Credibility Cache domain/author reputation scores
Scalability Sub-second search across millions of embeddings

Collections Used:

COLLECTIONS = {
  FACT_CHECKS: 'fact_checks',       // Cached claim verdicts
  AI_PATTERNS: 'ai_patterns',       // Known deepfake signatures
  SOURCE_CREDIBILITY: 'source_credibility',  // Domain reputation
  IMAGES: 'images',                 // Analyzed image results
  VIDEOS: 'videos',                 // Analyzed video results  
  AUDIO: 'audio',                   // Analyzed audio results
  AGENT_RESULTS: 'agent_results',   // Full agent verdict cache
}

Similarity Threshold: 92% cosine similarity triggers cache hit (tuned to balance speed vs accuracy)


3. Multimodal Strategy

Data Types Supported

Input Type Preprocessing Detection Focus
Text Claim extraction, entity recognition Misinformation only
URL Web scraping, content extraction Misinformation + source credibility
Image OCR, EXIF extraction, base64 encoding Deepfake + Misinformation (smart routing)
Audio Waveform analysis, transcription Deepfake (voice cloning) + Misinformation (claims)
Video Frame extraction, audio separation Deepfake + Misinformation (parallel analysis)

How Embeddings Are Created

Text Embeddings (768 dimensions):

// Using HuggingFace sentence-transformers
const EMBEDDING_MODEL = 'sentence-transformers/all-mpnet-base-v2';

async function generateEmbedding(text: string): Promise<number[]> {
  const result = await hfClient.featureExtraction({
    model: EMBEDDING_MODEL,
    inputs: truncatedText,  // Max ~2000 chars
  });
  return result;  // 768-dimensional vector
}

Image Embeddings:

  • With description: Use semantic embedding of extracted text/scene description
  • Without: SHA-256 hash-based embedding for exact match detection
  • Future: CLIP embeddings for true visual similarity

Audio/Video Embeddings:

  • Content hash (SHA-256) converted to 768-dim vector
  • Cached with full analysis results for fast retrieval

How Embeddings Are Queried

// Search with cosine similarity
const results = await qdrantClient.search(collectionName, {
  vector: embedding,
  limit: 5,
  score_threshold: 0.92,  // 92% similarity threshold
});

// If match found → return cached verdict
// If no match → run full agent analysis → store result

4. Search / Memory / Recommendation Logic

How Retrieval Works

┌──────────────────────────────────────────────────────────────────┐
│                      RETRIEVAL FLOW                               │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  1. EMBED INPUT                                                   │
│     └─→ Generate 768-dim vector from content                     │
│                                                                   │
│  2. SEARCH QDRANT                                                 │
│     └─→ Query relevant collection(s) with embedding              │
│     └─→ Return top-5 matches above 0.92 threshold                │
│                                                                   │
│  3. EVALUATE MATCHES                                              │
│     ├─→ Score ≥ 0.92: Cache HIT → Return stored verdict          │
│     └─→ Score < 0.92: Cache MISS → Continue to agents            │
│                                                                   │
│  4. STORE NEW RESULTS                                             │
│     └─→ After agent analysis, upsert with payload:               │
│         { verdict, confidence, evidence, timestamp, inputHash }  │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

How Memory is Stored, Updated, and Reused

Storage Schema:

// Qdrant point structure
{
  id: number,           // Hash-based or timestamp
  vector: number[768],  // Semantic embedding
  payload: {
    verdict: 'true' | 'false' | 'uncertain',
    confidence: number,           // 0-100
    isAiGenerated: boolean,
    aiConfidence: number,
    isMisinfo: boolean,
    misinfoConfidence: number,
    evidence: string[],
    sources: string[],
    inputHash: string,            // For exact match
    timestamp: string,
    agentResults: AgentVerdict[], // Full breakdown
  }
}

Update Strategy:

  • Upsert: Same content hash → updates existing record
  • TTL: No automatic expiration (fact-checks remain valid)
  • Manual refresh: Re-analysis overrides cached result

Reuse Patterns:

Scenario Action
Identical content Instant return (100% match)
Near-duplicate (>92%) Return cached with "similar content" flag
Rephrased claim Semantic match finds original verdict
New content Full analysis → store for future

Caching Performance

Metric Value
Cache hit latency <100ms
Full analysis latency 3-8 seconds
Cache hit rate (production) ~40% for viral content
Storage efficiency ~1KB per analyzed item

5. Limitations & Ethics

Known Failure Modes

Limitation Description Mitigation
Novel deepfake techniques New AI models may evade detection Regular model updates, ensemble approach
Low-quality input Heavily compressed media reduces accuracy Warn users when quality is too low
Adversarial attacks Crafted inputs designed to fool detectors Multiple independent tools, voting system
Language coverage Best accuracy for English content Translation layer for other languages
Satire/Parody May flag intentional fiction as "misinfo" Context signals, disclaimer detection
Breaking news No fact-checks available yet Low confidence score, "unverified" verdict
Partial deepfakes Only face swapped in otherwise real video Per-element analysis (face, voice, background)

Bias, Privacy, and Safety Considerations

Bias Risks:

Bias Type Concern Mitigation
Training data bias HuggingFace models trained on Western media Diverse model ensemble, confidence thresholds
Political bias Fact-check sources may have editorial slant Multiple cross-referenced sources
False positive harm Incorrectly flagging authentic content High threshold (>70% confidence), uncertainty option
Automation bias Users may over-trust AI verdicts Always show evidence, encourage critical thinking

Privacy Considerations:

  • ✅ No user data stored beyond analysis cache
  • ✅ Content embeddings are non-reversible (can't reconstruct original)
  • ✅ No PII extraction or storage
  • ⚠️ Uploaded media temporarily processed (deleted after analysis)
  • ⚠️ Qdrant caches analyzed content (configurable retention)

Safety Measures:

Measure Implementation
Rate limiting Prevent abuse of API endpoints
Safe browsing check Flag malicious URLs before analysis
Content moderation Gemini refuses to analyze illegal content
Transparency Full evidence trail for every verdict
Appeal mechanism Users can request manual review (future)

Ethical Principles:

  1. Transparency over opacity — Show why a verdict was reached
  2. Uncertainty is valid — "Uncertain" is better than a wrong answer
  3. Human in the loop — AI assists, humans decide
  4. No censorship — Detect and inform, don't block or remove
  5. Open methodology — Document how detection works

Quick Start

# Install dependencies
npm install

# Start Qdrant (required for caching)
docker run -p 6333:6333 qdrant/qdrant

# Set environment variables
cp .env.example .env.local
# Add: FACT_GEMINI_API_KEY, HUGGINGFACE_API_KEY, etc.

# Run development server
npm run dev

Input Type Flows

Text → Preprocessing → Vector DB (check similar claims) 
→ [Fact-Check Agent] → Voting → Result

Image → Classify (faces? text?) → Route to focus
→ [Deepfake Tools | Misinfo Tools | Both] → Voting → Result

Video → Upload to Gemini Files API → Parallel analysis
→ [Deepfake Analysis + Content Analysis] → Combine → Result

Audio → Gemini audio analysis (parallel)
→ [Deepfake Detection + Transcript Fact-Check] → Combine → Result

Tech Stack

Component Technology
Framework Next.js 14 (App Router)
AI Orchestration LangChain + LangGraph
Vector Database Qdrant
LLM Google Gemini 2.5 Flash
ML Models HuggingFace Inference API
Embeddings sentence-transformers/all-mpnet-base-v2
Deployment Docker + Cloud Run

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors