Skip to content

Ushnesha/PhishNChips

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhishNChips - Distributed Phishing Intelligence Platform

License: MIT Python 3.11 Elasticsearch 8.11

A geo-distributed phishing intelligence network demonstrating distributed database concepts using Elasticsearch and Kibana. The system enables real-time threat detection and global dissemination of phishing intelligence through browser extensions and a distributed backend.

🚀 Quick Start

# Start the cluster
make quick-start

# Or manually:
docker-compose up -d
python3 scripts/setup_elasticsearch.py
python3 scripts/generate_data.py --count 1000 --send --balanced

Access Points:

📋 Project Overview

Features

  • Distributed 3-Node Elasticsearch Cluster (US, EU, Asia regions)
  • FastAPI Backend with real-time ingestion and querying
  • Browser Extension for phishing detection and reporting
  • Kibana Dashboards for real-time threat visualization
  • Automated Testing for fault tolerance and scalability
  • Synthetic Data Generator for testing with 100K-1M records
  • Comprehensive Documentation and deployment guides

Distributed Database Concepts Demonstrated

Concept Implementation
Sharding 2 primary shards per index, distributed across nodes
Replication RF=2, automatic failover on node failure
Distributed Queries Cross-region aggregations and searches
Fault Tolerance Survives single node failure with <60s recovery
Horizontal Scalability Add nodes dynamically, automatic rebalancing
Consistency Models Quorum-based writes, eventual consistency reads

📚 Documentation

🏗️ Architecture

Browser Extensions (US, EU, ASIA)
        ↓ HTTP POST
    FastAPI Service
        ↓ Index
Elasticsearch Cluster (3 nodes)
   ↓ Replication
Kibana Dashboard

See ARCHITECTURE.md for detailed architecture diagrams and explanations.

🛠️ Technology Stack

  • Backend: Python 3.11, FastAPI
  • Database: Elasticsearch 8.11.0 (distributed cluster)
  • Visualization: Kibana 8.11.0
  • Frontend: Chrome Extension (JavaScript)
  • Containerization: Docker & Docker Compose
  • Testing: Python scripts with concurrent load testing

📦 Project Structure

Proj_app/
├── backend/                  # FastAPI application
│   ├── main.py              # API endpoints
│   ├── config.py            # Configuration
│   ├── Dockerfile           # API container
│   └── requirements.txt     # Python dependencies
├── browser-extension/        # Chrome extension
│   ├── manifest.json        # Extension config
│   ├── background.js        # Detection logic
│   ├── popup.html/js        # User interface
│   └── content.js           # Page analysis
├── scripts/                  # Utility scripts
│   ├── generate_data.py     # Synthetic data generator
│   ├── test_fault_tolerance.py  # Fault tolerance tests
│   ├── test_scalability.py  # Scalability tests
│   └── setup_elasticsearch.py   # ES initialization
├── config/                   # Configuration files
│   ├── elasticsearch_template.json  # Index template
│   └── kibana_dashboards.ndjson     # Dashboard config
├── docker-compose.yml        # Container orchestration
├── Makefile                  # Command shortcuts
└── Documentation files       # Setup, architecture, deployment

🧪 Testing & Evaluation

Fault Tolerance Testing

Test cluster resilience under node failures:

make test-fault
# Tests each node failure, measures recovery time and data availability

Scalability Testing

Measure performance improvements with horizontal scaling:

make test-scale
# Tests throughput and latency with different cluster configurations

Data Generation

Generate synthetic phishing data:

# Generate 10,000 balanced reports
python3 scripts/generate_data.py --count 10000 --send --balanced

# Generate region-specific data
python3 scripts/generate_data.py --count 1000 --region US --send

📊 Performance Metrics

Based on testing with 3-node cluster (8GB RAM, 4 CPU per node):

Metric Value
Write Throughput ~800 reports/sec (concurrent)
Write Latency (P95) 25ms
Query Latency (P95) 40ms
Recovery Time <60s after node failure
Data Capacity Tested with 1M records
Scalability +30% throughput with 4th node

🌟 Key Achievements

  • ✅ End-to-end distributed phishing intelligence platform
  • ✅ Demonstrates 6+ distributed database concepts
  • ✅ Handles 100K-1M records with low latency
  • ✅ Automatic failover with zero data loss
  • ✅ Real-time threat visualization
  • ✅ Production-ready architecture with monitoring

🔧 Development Commands

# Start all services
make start

# Setup Elasticsearch indices
make setup-es

# Generate test data
make generate-data

# Run fault tolerance tests
make test-fault

# View logs
make logs

# Stop services
make stop

# Clean up everything
make clean

🌐 Browser Extension

Install the PhishNChips extension to detect and report phishing sites:

  1. Generate icons: make icons
  2. Open Chrome → chrome://extensions/
  3. Enable "Developer mode"
  4. Click "Load unpacked" → Select browser-extension/ folder
  5. Visit any website to see the risk analysis

📝 API Endpoints

Endpoint Method Description
/ GET Health check
/report POST Submit phishing report
/threats GET Query recent threats
/hotspots GET Regional threat statistics
/cluster/health GET Cluster health metrics
/stats GET System statistics

Full API documentation: http://localhost:8000/docs

🔐 Security Note

Current implementation is for development/demonstration only. For production:

  • Enable X-Pack Security with TLS
  • Implement API authentication
  • Use HTTPS everywhere
  • Enable audit logging
  • See DEPLOYMENT.md for security hardening

📖 Academic Context

This project demonstrates distributed database systems concepts including:

  • Partitioning & Sharding - Data distribution across nodes
  • Replication - Fault tolerance and availability
  • Consistency Models - CAP theorem tradeoffs
  • Distributed Queries - Cross-node aggregations
  • Fault Tolerance - Automatic recovery
  • Horizontal Scalability - Linear performance improvements

🤝 Contributing

This is an academic project for demonstration purposes. Suggestions and improvements welcome!

📄 License

MIT License - See LICENSE file for details

👥 Authors

Created as part of Distributed Database Systems course project.

🙏 Acknowledgments

  • Elasticsearch documentation and community
  • FastAPI framework
  • Chrome extension development guides
  • Distributed systems research papers

Original Project Proposal

1. Introduction

1.1. Background

In this contemporary digital age, phishing is still one of the most compromising cyber threats where hundreds or thousands of new malicious websites or domains are created every single day. Most traditional detection systems depend on a centralized database that periodically updates threat signatures. This centralized setup often leads to delays, single points of failure, and puts users located in different geographies at risk against new phishing attacks that are unique to their geographic locations. Distributed database systems provide an alternative to this approach in terms of where data is stored, replicated, and queried, and allow it to be distributed across multiple nodes and regions. Using technologies like Elasticsearch and Kibana, ingestion, analysis, and visualization of real-time browser security data can be done in a fault tolerant and scalable manner. [1] PhishNChips envisions a distributed phishing intelligence network where browsers located in different regions act as sensors, sending notices of suspicious URLs to Elasticsearch clusters in their location. The clusters can share and replicate threat intelligence globally, which provides additional protection to users that would not otherwise be aware of the newly created threat. Through the PhishNChips project, we demonstrate how replication, partitioning, fault tolerance and distributed querying can be leveraged in a real-world cybersecurity use case.

1.2. Problem Statement

Phishing campaigns progress at a greater rate than traditional threat detection mechanisms. There are available browser extensions that can identify a malicious web site on the local machine, but that information often stays in a silo or propagates globally in a delayed manner [2] [3] [4]. PhishNChips addresses this challenge by creating a distributed, real-time phishing intelligence platform that leverages Elasticsearch [1] [5]. Each node contains a repository of phishing reporting from their local region and globally replicates the data sets to create awareness on a geo-redundant basis. The system allows for scalable ingestion, fault-tolerant data replication, and live visualization with dashboards created in Kibana.

1.3. Objectives

  • To architect and deploy a geo-distributed phishing intelligence platform on the basis of Elasticsearch’s multi-cluster abilities.
  • To enable an end-to-end data pipeline, with a real-time threat reporting browser client and backend API for querying and data ingestion [1] [3] [4].
  • To ensure data consistency, high availability, and scalability in a distributed system, with special emphasis on cross-cluster replication and failover automation using millions of records.
  • To measure the system's performance, dependability, and latency in disseminating data under different conditions, such as simulated regional node failure and network partition.
  • To create a real-time visualization dashboard with Kibana presenting a global, composite picture of the phishing threats in an illustration of the power of distributed querying [1].

Deliverables:

  • Browser Extension & Data Collection
  • Distributed Elasticsearch Cluster Setup
  • API and Backend Services
  • Kibana Dashboards and Evaluation

2. System Architecture and Design

Fig.1 High-Level Architecture of the System Approach using Elastic Search and Kibana

2.1 System Design

Data Collection To identify suspicious URLs, browser extensions leverage well-defined heuristic or machine-learning (ML) scoring approaches (e.g., unusual patterns in domain, login forms present on a non-HTTPS page). For now, we are confining our scope of the project to URLs collection and threat detections.

Each browser extension sends a JSON report to a remote endpoint that looks something like the following:

{
  "url": "[http://example-login.net](http://example-login.net)",
  "risk_score": 0.92,
  "region": "US",
  "timestamp": "2025-10-20T14:00:00Z"
}

Data Ingestion Layer A Python FastAPI service receives incoming reports and indexes them to its local Elasticsearch node. Each regional node (i.e., us-east, eu-central, asia-south) will store reports to indices with the region tagged (i.e., phish-us, phish-eu, phish-asia).

Elasticsearch Cluster

  • Multi-node cluster with sharding and replication [7].
  • An index template specifies mappings for url, risk_score, region, and timestamp.
  • Automatic replication makes sure that the threat data from one region is available across nodes in other regions.
  • Elasticsearch's distributed query engine makes real-time global searches possible on all nodes.

Kibana Visualization Kibana dashboards display:

  • A real-time indicator of phishing alerts
  • Geographic heatmaps of collected URLs
  • Time-series graphs of phishing occurrences over time
  • Brand impersonation detection and analysis

Replication and Fault Tolerance

  • Document shards are replicated across nodes and there is an automatic reallocation of shards to ensure continuity of operations if a node fails to respond.
  • The cluster provides near-real-time read/write consistency using quorum-based strategies [1].
  • The cluster will recover automatically after the node restarts.

API Service The backend provides:

  • POST /report (phishing report submission)
  • GET /threats (recent threats retrieval)
  • GET /hotspots (aggregate of stats based on region)

2.2 Implementation Plan

Our implementation strategy is based on an up-to-date, containerized technology stack that allows us to realistically simulate a geo-distributed setup on local hardware [5]. We plan to do our development in logical phases, starting with the backend infrastructure and incrementally building out the data collection, visualization, and evaluation layers.

Programming Languages, Databases, and Tools

  • Backend and API: Our core API service will be implemented in Python 3.11, utilizing the high-performance FastAPI framework to implement the ingestion and query endpoints. This enables rapid development and asynchronous request handling, ideal for a real-time data pipeline [1].
  • Distributed Database: Elasticsearch will be our primary distributed database system. We will leverage its native support for sharding, replication, and distributed querying to store and manage the phishing report data [5].
  • Visualization and Analytics: Kibana will be our primary tool for creating interactive dashboards and visualizing the real-time threat intelligence data stored in Elasticsearch.
  • Containerization and Simulation: Docker Compose is also a central component of our solution. It will be used to define and run a multi-node, multi-region Elasticsearch cluster on a single host, providing a handy and manageable simulation of a real-world distributed environment.
  • Frontend Data Source: A simple Browser Extension with standard web technologies (JavaScript, HTML, CSS) will be developed to act as the client-side data sensor.
  • Phishing Detection Model: We plan to implement a threat detection model within the extension, which might be imposed using a lightweight machine learning model like DistilBERT or a simpler heuristic-based model [2].

Development Approach

The project will be implemented in a series of integrated steps:

  1. Distributed Cluster Setup: Most of the project’s workload is used up in setting up the distributed environment. Using Docker Compose, we roll out and set up a three-node Elasticsearch cluster. The setup will be adapted to simulate regional nodes, with indices specified to segregate data from different geolocation regions (e.g., phish-us, phish-eu, phish-asia).
  2. API Layer Development: Once the cluster comes online, we will run the API layer in FastAPI. The service will be the single entry point for data ingestion, with one endpoint to ingest reports through the browser [3] [4]. The API will validate the incoming data and take advantage of Elasticsearch's bulk indexing capability to do bulk efficient high-throughput data insertion.
  3. ML Model Deployment and Ingest Pipeline Configuration: We shall directly load a pre-trained phishing detection model (e.g., a heuristic or NLP-based like DistilBERT) into Elasticsearch [2]. An ingest pipeline will be configured to run inference with this model. When the API inserts a new URL into the cluster, this pipeline would be automatically utilized, would generate a risk_score, and append it to the document before indexing.
  4. Client-Side Data Collection: At the same time, a simple Chrome extension will be created [2] [3] [4]. It will have a single function: to check (initially, through simulation) dubious URLs and pass them on in nicely formatted JSON to the API ingestion endpoint.
  5. Demonstration of Distributed Features: As data is flowing into the cluster, we will focus on testing and exhibiting the inherent distributed characteristics. Node failure at a region will be emulated by shutting down a Docker container and, in addition to Elasticsearch internal metrics, our API will be used to test the auto-failover and self-healing of the cluster.
  6. Dashboard Configuration and Evaluation: Finally, we add Kibana to our cluster for the installation of live visualization dashboards. We statistically capture major performance measurements, such as replication lag, query latency, and overall system fault tolerance across test failures, in the final test evaluation stage.

2.3 Data Strategy

Data Generation Creation of synthetic phishing data will be used for detection across multiple geographic regions.

Nature of Data

  • Volume: 100k-1M simulated phishing reports.
  • Primary fields: url, risk_score, region, timestamp.
  • Indexing strategy - reports will be indexed by region for improved cross-regional and regional searching

Data Privacy All data will be synthetic and anonymized. No personally identifiable information (PII) will be obtained or stored.

3. Methodology

Technique 1 - Sharding and Replication in Elasticsearch

  • Scope: To partition phishing reports into regions and copy them from one node to another in the cluster.
  • Approach: We will have independent, region-based indices such as phish-us, phish-eu, and phish-asia. One master shard and a minimum of one replica shard will be configured for each index. We will then induce a regional node failure to see how the data becomes available.
  • Metrics: We will see replication lag between nodes every so often, typical operation query latency, and recovery time of a shard in case of failure. [6]
  • Desired Outcome: To demonstrate distributed data redundancy and fault tolerance.

Technique 2 - Distributed Querying and Aggregation

  • Scope: To run complex queries across multiple nodes and geographies to create global insights.
  • Approach:
    • With Kibana, we will run cross-index queries (e.g., give all phishing reports with risk_score > 0.8 that happened in the last hour).
    • We also intend to visualize time-series aggregations and geospatial heatmaps.
  • Metrics: The main metrics will be query latency, i.e., the p50 and p95 response times, and the overall aggregation efficiency. We will also measure the dashboard update latency and visualization response time on the UI side to have seamless user experience.
  • Desired Outcome: To show that our distributed architecture can provide actionable real-time insights with minimal latency.

Technique 3 - Fault Tolerance Testing

  • Scope: Systematically analyze the recovery behavior and resilience of the system when a sudden node failure occurs.
  • Approach:
    • We will intentionally take one of the nodes in our Elasticsearch cluster offline to create a failure event.
    • During the downtime, we will keep bringing fresh reports to ensure there is no data loss and observe the automatic self-recovery process of the cluster as it continues on replica shards.
  • Metrics: We will measure the total cluster recovery time, the overall system uptime percentage, and the error rate for requests made during the failure event. Furthermore, we will also estimate the Shard Reallocation Efficiency and check for zero data loss. [6]
  • Desired Outcome: To validate the system's resilience and its automated self-healing capabilities in a distributed setup.

Technique 4 - Horizontal Scalability Evaluation

  • Scope: Adding more nodes to the cluster will increase the overall system performance
  • Approach: To showcase this, we will add a fourth node into our already setup 3-node cluster using Docker Compose. By nature, the cluster will rebalance its shards to these 4 nodes. Once this is done, we will re-run our ingestion pipeline and tests for the performance improvement.
  • Metrics: We will measure the effect of horizontal scalability with respect to the increase in Indexing Throughput (reports/sec) and a reduction in Query Latency [6].
  • Desired Outcome: Through this technique, we want to show that our system is capable of horizontal scaling in case of unexpected increase in data in-flow and user loads, showcasing the advantages of distributed database systems.

4.2 Expected Outcomes

  1. Real-Time Threat Detection and Global Dissemination
    • Achieve an average ingestion latency of less than 200ms for phishing reports submitted via the browser extensions.
    • Achieve distribution of new threats worldwide within seconds through Elasticsearch replication between regions.
    • Enable the user and dashboard to query new phishing data immediately, demonstrating that they are aware of emerging attacks.
    • Verify that when a phishing URL has been identified in one region, it is immediately visible to each other region without having to be updated manually.
  2. Scalable, Distributed Intelligence Network
    • To support up to 5000 phishing reports per second using the FastAPI ingestion layer.
    • Easily scale by adding nodes or regions to the Elasticsearch infrastructure without downtime or major reconfiguration.
    • To illustrate the balanced data distribution and query efficiency by partitioning ES indexes region-wise (e.g. phish-us, phish-eu, phish-asia)
    • Sharding and automatic rebalancing demonstration during new nodes addition, or nodes failure, or nodes recovery [7].
  3. Fault Tolerance and Continuous Operations
    • To provide uninterrupted operations during node or cluster outages, using an existing Elasticsearch installation with built-in replication and quorum-based consistency.
    • Show zero data loss during simulated failures using replication factor (RF=2).
    • Demonstrate automatic recovery time (<60s) when failed nodes are added back to the cluster.
    • Demonstrate failover and recovery action using metrics and logs from Kibana.
  4. Consistency of Data and Reliable Queries
    • Ensure accurate and up-to-date phishing data across all replicas and regions.
    • Show eventual consistency in distributed replication without sacrificing read availability.
    • Execute cross-region search and aggregation queries that demonstrate correctness and responsiveness.
    • Track query latency (p50, p95) and remain current with the dashboards even while replicating.
  5. Observability and Threat Analysis
    • Provide rich dashboards that show in real time in Kibana including:
      • Incoming phishing alerts and detection time on the timelines.
      • Geospatial heatmaps of phish-threatened locations shown by region.
      • Time-series charts showing phishing trends and spikes.
      • Brand impersonation analytics showing domain names and content-based patterns.
    • Share cluster health metrics that provide indexing rate, replication lag, and query latency.
    • Visualize performance and stability graphs showing metrics before, during, and after each node failure.
  6. Illustration of Distributed Databases Principles
    • We would be sharing a working prototype that will demonstrate distributed databases functionality end-to-end, maximizing below features like:
      • Partitioning, replication, fault tolerance, and efficient distributed querying.
      • A discussion of latency, consistency, and availability under a variety of configurations or settings.
      • Application of this concept in a real-world cybersecurity scenario.

About

A geo-distributed phishing intelligence network demonstrating distributed database concepts using Elasticsearch and Kibana. The system enables real-time threat detection and global dissemination of phishing intelligence through browser extensions and a distributed backend.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors