PhishNChips - Distributed Phishing Intelligence Platform

A geo-distributed phishing intelligence network demonstrating distributed database concepts using Elasticsearch and Kibana. The system enables real-time threat detection and global dissemination of phishing intelligence through browser extensions and a distributed backend.

🚀 Quick Start

# Start the cluster
make quick-start

# Or manually:
docker-compose up -d
python3 scripts/setup_elasticsearch.py
python3 scripts/generate_data.py --count 1000 --send --balanced

Access Points:

🌐 Kibana Dashboard: http://localhost:5601
📊 API Docs: http://localhost:8000/docs
🔍 Elasticsearch: http://localhost:9200

📋 Project Overview

Features

✅ Distributed 3-Node Elasticsearch Cluster (US, EU, Asia regions)
✅ FastAPI Backend with real-time ingestion and querying
✅ Browser Extension for phishing detection and reporting
✅ Kibana Dashboards for real-time threat visualization
✅ Automated Testing for fault tolerance and scalability
✅ Synthetic Data Generator for testing with 100K-1M records
✅ Comprehensive Documentation and deployment guides

Distributed Database Concepts Demonstrated

Concept	Implementation
Sharding	2 primary shards per index, distributed across nodes
Replication	RF=2, automatic failover on node failure
Distributed Queries	Cross-region aggregations and searches
Fault Tolerance	Survives single node failure with <60s recovery
Horizontal Scalability	Add nodes dynamically, automatic rebalancing
Consistency Models	Quorum-based writes, eventual consistency reads

📚 Documentation

SETUP.md - Detailed setup instructions and troubleshooting
ARCHITECTURE.md - System design and distributed concepts
DEPLOYMENT.md - Production deployment and scaling guide

🏗️ Architecture

Browser Extensions (US, EU, ASIA)
        ↓ HTTP POST
    FastAPI Service
        ↓ Index
Elasticsearch Cluster (3 nodes)
   ↓ Replication
Kibana Dashboard

See ARCHITECTURE.md for detailed architecture diagrams and explanations.

🛠️ Technology Stack

Backend: Python 3.11, FastAPI
Database: Elasticsearch 8.11.0 (distributed cluster)
Visualization: Kibana 8.11.0
Frontend: Chrome Extension (JavaScript)
Containerization: Docker & Docker Compose
Testing: Python scripts with concurrent load testing

📦 Project Structure

Proj_app/
├── backend/                  # FastAPI application
│   ├── main.py              # API endpoints
│   ├── config.py            # Configuration
│   ├── Dockerfile           # API container
│   └── requirements.txt     # Python dependencies
├── browser-extension/        # Chrome extension
│   ├── manifest.json        # Extension config
│   ├── background.js        # Detection logic
│   ├── popup.html/js        # User interface
│   └── content.js           # Page analysis
├── scripts/                  # Utility scripts
│   ├── generate_data.py     # Synthetic data generator
│   ├── test_fault_tolerance.py  # Fault tolerance tests
│   ├── test_scalability.py  # Scalability tests
│   └── setup_elasticsearch.py   # ES initialization
├── config/                   # Configuration files
│   ├── elasticsearch_template.json  # Index template
│   └── kibana_dashboards.ndjson     # Dashboard config
├── docker-compose.yml        # Container orchestration
├── Makefile                  # Command shortcuts
└── Documentation files       # Setup, architecture, deployment

🧪 Testing & Evaluation

Fault Tolerance Testing

Test cluster resilience under node failures:

make test-fault
# Tests each node failure, measures recovery time and data availability

Scalability Testing

Measure performance improvements with horizontal scaling:

make test-scale
# Tests throughput and latency with different cluster configurations

Data Generation

Generate synthetic phishing data:

# Generate 10,000 balanced reports
python3 scripts/generate_data.py --count 10000 --send --balanced

# Generate region-specific data
python3 scripts/generate_data.py --count 1000 --region US --send

📊 Performance Metrics

Based on testing with 3-node cluster (8GB RAM, 4 CPU per node):

Metric	Value
Write Throughput	~800 reports/sec (concurrent)
Write Latency (P95)	25ms
Query Latency (P95)	40ms
Recovery Time	<60s after node failure
Data Capacity	Tested with 1M records
Scalability	+30% throughput with 4th node

🌟 Key Achievements

✅ End-to-end distributed phishing intelligence platform
✅ Demonstrates 6+ distributed database concepts
✅ Handles 100K-1M records with low latency
✅ Automatic failover with zero data loss
✅ Real-time threat visualization
✅ Production-ready architecture with monitoring

🔧 Development Commands

# Start all services
make start

# Setup Elasticsearch indices
make setup-es

# Generate test data
make generate-data

# Run fault tolerance tests
make test-fault

# View logs
make logs

# Stop services
make stop

# Clean up everything
make clean

🌐 Browser Extension

Install the PhishNChips extension to detect and report phishing sites:

Generate icons: make icons
Open Chrome → chrome://extensions/
Enable "Developer mode"
Click "Load unpacked" → Select browser-extension/ folder
Visit any website to see the risk analysis

📝 API Endpoints

Endpoint	Method	Description
`/`	GET	Health check
`/report`	POST	Submit phishing report
`/threats`	GET	Query recent threats
`/hotspots`	GET	Regional threat statistics
`/cluster/health`	GET	Cluster health metrics
`/stats`	GET	System statistics

Full API documentation: http://localhost:8000/docs

🔐 Security Note

Current implementation is for development/demonstration only. For production:

Enable X-Pack Security with TLS
Implement API authentication
Use HTTPS everywhere
Enable audit logging
See DEPLOYMENT.md for security hardening

📖 Academic Context

This project demonstrates distributed database systems concepts including:

Partitioning & Sharding - Data distribution across nodes
Replication - Fault tolerance and availability
Consistency Models - CAP theorem tradeoffs
Distributed Queries - Cross-node aggregations
Fault Tolerance - Automatic recovery
Horizontal Scalability - Linear performance improvements

🤝 Contributing

This is an academic project for demonstration purposes. Suggestions and improvements welcome!

📄 License

MIT License - See LICENSE file for details

👥 Authors

Created as part of Distributed Database Systems course project.

🙏 Acknowledgments

Elasticsearch documentation and community
FastAPI framework
Chrome extension development guides
Distributed systems research papers

Original Project Proposal

1. Introduction

1.1. Background

In this contemporary digital age, phishing is still one of the most compromising cyber threats where hundreds or thousands of new malicious websites or domains are created every single day. Most traditional detection systems depend on a centralized database that periodically updates threat signatures. This centralized setup often leads to delays, single points of failure, and puts users located in different geographies at risk against new phishing attacks that are unique to their geographic locations. Distributed database systems provide an alternative to this approach in terms of where data is stored, replicated, and queried, and allow it to be distributed across multiple nodes and regions. Using technologies like Elasticsearch and Kibana, ingestion, analysis, and visualization of real-time browser security data can be done in a fault tolerant and scalable manner. [1] PhishNChips envisions a distributed phishing intelligence network where browsers located in different regions act as sensors, sending notices of suspicious URLs to Elasticsearch clusters in their location. The clusters can share and replicate threat intelligence globally, which provides additional protection to users that would not otherwise be aware of the newly created threat. Through the PhishNChips project, we demonstrate how replication, partitioning, fault tolerance and distributed querying can be leveraged in a real-world cybersecurity use case.

1.2. Problem Statement

Phishing campaigns progress at a greater rate than traditional threat detection mechanisms. There are available browser extensions that can identify a malicious web site on the local machine, but that information often stays in a silo or propagates globally in a delayed manner [2] [3] [4]. PhishNChips addresses this challenge by creating a distributed, real-time phishing intelligence platform that leverages Elasticsearch [1] [5]. Each node contains a repository of phishing reporting from their local region and globally replicates the data sets to create awareness on a geo-redundant basis. The system allows for scalable ingestion, fault-tolerant data replication, and live visualization with dashboards created in Kibana.

1.3. Objectives

To architect and deploy a geo-distributed phishing intelligence platform on the basis of Elasticsearch’s multi-cluster abilities.
To enable an end-to-end data pipeline, with a real-time threat reporting browser client and backend API for querying and data ingestion [1] [3] [4].
To ensure data consistency, high availability, and scalability in a distributed system, with special emphasis on cross-cluster replication and failover automation using millions of records.
To measure the system's performance, dependability, and latency in disseminating data under different conditions, such as simulated regional node failure and network partition.
To create a real-time visualization dashboard with Kibana presenting a global, composite picture of the phishing threats in an illustration of the power of distributed querying [1].

Deliverables:

Browser Extension & Data Collection
Distributed Elasticsearch Cluster Setup
API and Backend Services
Kibana Dashboards and Evaluation

2. System Architecture and Design

Fig.1 High-Level Architecture of the System Approach using Elastic Search and Kibana

2.1 System Design

Data Collection To identify suspicious URLs, browser extensions leverage well-defined heuristic or machine-learning (ML) scoring approaches (e.g., unusual patterns in domain, login forms present on a non-HTTPS page). For now, we are confining our scope of the project to URLs collection and threat detections.

Each browser extension sends a JSON report to a remote endpoint that looks something like the following:

{
  "url": "[http://example-login.net](http://example-login.net)",
  "risk_score": 0.92,
  "region": "US",
  "timestamp": "2025-10-20T14:00:00Z"
}

Data Ingestion Layer A Python FastAPI service receives incoming reports and indexes them to its local Elasticsearch node. Each regional node (i.e., us-east, eu-central, asia-south) will store reports to indices with the region tagged (i.e., phish-us, phish-eu, phish-asia).

Elasticsearch Cluster

Multi-node cluster with sharding and replication [7].
An index template specifies mappings for url, risk_score, region, and timestamp.
Automatic replication makes sure that the threat data from one region is available across nodes in other regions.
Elasticsearch's distributed query engine makes real-time global searches possible on all nodes.

Kibana Visualization Kibana dashboards display:

A real-time indicator of phishing alerts
Geographic heatmaps of collected URLs
Time-series graphs of phishing occurrences over time
Brand impersonation detection and analysis

Replication and Fault Tolerance

Document shards are replicated across nodes and there is an automatic reallocation of shards to ensure continuity of operations if a node fails to respond.
The cluster provides near-real-time read/write consistency using quorum-based strategies [1].
The cluster will recover automatically after the node restarts.

API Service The backend provides:

POST /report (phishing report submission)
GET /threats (recent threats retrieval)
GET /hotspots (aggregate of stats based on region)

2.2 Implementation Plan

Our implementation strategy is based on an up-to-date, containerized technology stack that allows us to realistically simulate a geo-distributed setup on local hardware [5]. We plan to do our development in logical phases, starting with the backend infrastructure and incrementally building out the data collection, visualization, and evaluation layers.

Programming Languages, Databases, and Tools

Backend and API: Our core API service will be implemented in Python 3.11, utilizing the high-performance FastAPI framework to implement the ingestion and query endpoints. This enables rapid development and asynchronous request handling, ideal for a real-time data pipeline [1].
Distributed Database: Elasticsearch will be our primary distributed database system. We will leverage its native support for sharding, replication, and distributed querying to store and manage the phishing report data [5].
Visualization and Analytics: Kibana will be our primary tool for creating interactive dashboards and visualizing the real-time threat intelligence data stored in Elasticsearch.
Containerization and Simulation: Docker Compose is also a central component of our solution. It will be used to define and run a multi-node, multi-region Elasticsearch cluster on a single host, providing a handy and manageable simulation of a real-world distributed environment.
Frontend Data Source: A simple Browser Extension with standard web technologies (JavaScript, HTML, CSS) will be developed to act as the client-side data sensor.
Phishing Detection Model: We plan to implement a threat detection model within the extension, which might be imposed using a lightweight machine learning model like DistilBERT or a simpler heuristic-based model [2].

Development Approach

The project will be implemented in a series of integrated steps:

Distributed Cluster Setup: Most of the project’s workload is used up in setting up the distributed environment. Using Docker Compose, we roll out and set up a three-node Elasticsearch cluster. The setup will be adapted to simulate regional nodes, with indices specified to segregate data from different geolocation regions (e.g., phish-us, phish-eu, phish-asia).
API Layer Development: Once the cluster comes online, we will run the API layer in FastAPI. The service will be the single entry point for data ingestion, with one endpoint to ingest reports through the browser [3] [4]. The API will validate the incoming data and take advantage of Elasticsearch's bulk indexing capability to do bulk efficient high-throughput data insertion.
ML Model Deployment and Ingest Pipeline Configuration: We shall directly load a pre-trained phishing detection model (e.g., a heuristic or NLP-based like DistilBERT) into Elasticsearch [2]. An ingest pipeline will be configured to run inference with this model. When the API inserts a new URL into the cluster, this pipeline would be automatically utilized, would generate a risk_score, and append it to the document before indexing.
Client-Side Data Collection: At the same time, a simple Chrome extension will be created [2] [3] [4]. It will have a single function: to check (initially, through simulation) dubious URLs and pass them on in nicely formatted JSON to the API ingestion endpoint.
Demonstration of Distributed Features: As data is flowing into the cluster, we will focus on testing and exhibiting the inherent distributed characteristics. Node failure at a region will be emulated by shutting down a Docker container and, in addition to Elasticsearch internal metrics, our API will be used to test the auto-failover and self-healing of the cluster.
Dashboard Configuration and Evaluation: Finally, we add Kibana to our cluster for the installation of live visualization dashboards. We statistically capture major performance measurements, such as replication lag, query latency, and overall system fault tolerance across test failures, in the final test evaluation stage.

2.3 Data Strategy

Data Generation Creation of synthetic phishing data will be used for detection across multiple geographic regions.

Nature of Data

Volume: 100k-1M simulated phishing reports.
Primary fields: url, risk_score, region, timestamp.
Indexing strategy - reports will be indexed by region for improved cross-regional and regional searching

Data Privacy All data will be synthetic and anonymized. No personally identifiable information (PII) will be obtained or stored.

3. Methodology

Technique 1 - Sharding and Replication in Elasticsearch

Scope: To partition phishing reports into regions and copy them from one node to another in the cluster.
Approach: We will have independent, region-based indices such as phish-us, phish-eu, and phish-asia. One master shard and a minimum of one replica shard will be configured for each index. We will then induce a regional node failure to see how the data becomes available.
Metrics: We will see replication lag between nodes every so often, typical operation query latency, and recovery time of a shard in case of failure. [6]
Desired Outcome: To demonstrate distributed data redundancy and fault tolerance.

Technique 2 - Distributed Querying and Aggregation

Scope: To run complex queries across multiple nodes and geographies to create global insights.
Approach:
- With Kibana, we will run cross-index queries (e.g., give all phishing reports with risk_score > 0.8 that happened in the last hour).
- We also intend to visualize time-series aggregations and geospatial heatmaps.
Metrics: The main metrics will be query latency, i.e., the p50 and p95 response times, and the overall aggregation efficiency. We will also measure the dashboard update latency and visualization response time on the UI side to have seamless user experience.
Desired Outcome: To show that our distributed architecture can provide actionable real-time insights with minimal latency.

Technique 3 - Fault Tolerance Testing

Scope: Systematically analyze the recovery behavior and resilience of the system when a sudden node failure occurs.
Approach:
- We will intentionally take one of the nodes in our Elasticsearch cluster offline to create a failure event.
- During the downtime, we will keep bringing fresh reports to ensure there is no data loss and observe the automatic self-recovery process of the cluster as it continues on replica shards.
Metrics: We will measure the total cluster recovery time, the overall system uptime percentage, and the error rate for requests made during the failure event. Furthermore, we will also estimate the Shard Reallocation Efficiency and check for zero data loss. [6]
Desired Outcome: To validate the system's resilience and its automated self-healing capabilities in a distributed setup.

Technique 4 - Horizontal Scalability Evaluation

Scope: Adding more nodes to the cluster will increase the overall system performance
Approach: To showcase this, we will add a fourth node into our already setup 3-node cluster using Docker Compose. By nature, the cluster will rebalance its shards to these 4 nodes. Once this is done, we will re-run our ingestion pipeline and tests for the performance improvement.
Metrics: We will measure the effect of horizontal scalability with respect to the increase in Indexing Throughput (reports/sec) and a reduction in Query Latency [6].
Desired Outcome: Through this technique, we want to show that our system is capable of horizontal scaling in case of unexpected increase in data in-flow and user loads, showcasing the advantages of distributed database systems.

4.2 Expected Outcomes

Real-Time Threat Detection and Global Dissemination
- Achieve an average ingestion latency of less than 200ms for phishing reports submitted via the browser extensions.
- Achieve distribution of new threats worldwide within seconds through Elasticsearch replication between regions.
- Enable the user and dashboard to query new phishing data immediately, demonstrating that they are aware of emerging attacks.
- Verify that when a phishing URL has been identified in one region, it is immediately visible to each other region without having to be updated manually.
Scalable, Distributed Intelligence Network
- To support up to 5000 phishing reports per second using the FastAPI ingestion layer.
- Easily scale by adding nodes or regions to the Elasticsearch infrastructure without downtime or major reconfiguration.
- To illustrate the balanced data distribution and query efficiency by partitioning ES indexes region-wise (e.g. phish-us, phish-eu, phish-asia)
- Sharding and automatic rebalancing demonstration during new nodes addition, or nodes failure, or nodes recovery [7].
Fault Tolerance and Continuous Operations
- To provide uninterrupted operations during node or cluster outages, using an existing Elasticsearch installation with built-in replication and quorum-based consistency.
- Show zero data loss during simulated failures using replication factor (RF=2).
- Demonstrate automatic recovery time (<60s) when failed nodes are added back to the cluster.
- Demonstrate failover and recovery action using metrics and logs from Kibana.
Consistency of Data and Reliable Queries
- Ensure accurate and up-to-date phishing data across all replicas and regions.
- Show eventual consistency in distributed replication without sacrificing read availability.
- Execute cross-region search and aggregation queries that demonstrate correctness and responsiveness.
- Track query latency (p50, p95) and remain current with the dashboards even while replicating.
Observability and Threat Analysis
- Provide rich dashboards that show in real time in Kibana including:
  - Incoming phishing alerts and detection time on the timelines.
  - Geospatial heatmaps of phish-threatened locations shown by region.
  - Time-series charts showing phishing trends and spikes.
  - Brand impersonation analytics showing domain names and content-based patterns.
- Share cluster health metrics that provide indexing rate, replication lag, and query latency.
- Visualize performance and stability graphs showing metrics before, during, and after each node failure.
Illustration of Distributed Databases Principles
- We would be sharing a working prototype that will demonstrate distributed databases functionality end-to-end, maximizing below features like:
  - Partitioning, replication, fault tolerance, and efficient distributed querying.
  - A discussion of latency, consistency, and availability under a variety of configurations or settings.
  - Application of this concept in a real-world cybersecurity scenario.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
backend		backend
browser-extension		browser-extension
config		config
scripts		scripts
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
ARCHITECTURE_ML.md		ARCHITECTURE_ML.md
BROWSER_EXTENSION_GUIDE.md		BROWSER_EXTENSION_GUIDE.md
CLOUD_COMMANDS.sh		CLOUD_COMMANDS.sh
CLOUD_DEPLOYMENT.md		CLOUD_DEPLOYMENT.md
CLOUD_QUICK_START.md		CLOUD_QUICK_START.md
COMPLETE_SYSTEM_FLOW.md		COMPLETE_SYSTEM_FLOW.md
DEPLOYMENT.md		DEPLOYMENT.md
DOCKER_CLOUD_SETUP.md		DOCKER_CLOUD_SETUP.md
DOCKER_COMPOSE_VERIFICATION.md		DOCKER_COMPOSE_VERIFICATION.md
ELASTICSEARCH_DATA_FORMAT.md		ELASTICSEARCH_DATA_FORMAT.md
END_TO_END_1M_DEMO.md		END_TO_END_1M_DEMO.md
HORIZONTAL_SCALABILITY_GUIDE.md		HORIZONTAL_SCALABILITY_GUIDE.md
KIBANA_VISUALIZATION_GUIDE.md		KIBANA_VISUALIZATION_GUIDE.md
LICENSE		LICENSE
METRICS_MEASUREMENT_GUIDE.md		METRICS_MEASUREMENT_GUIDE.md
ML_SCORING_GUIDE.md		ML_SCORING_GUIDE.md
Makefile		Makefile
QUERY_LATENCY_OPTIMIZATION.md		QUERY_LATENCY_OPTIMIZATION.md
QUICKREF.md		QUICKREF.md
QUICK_CLUSTER_INFO.md		QUICK_CLUSTER_INFO.md
QUICK_COMMANDS.md		QUICK_COMMANDS.md
QUICK_START_1M.md		QUICK_START_1M.md
REFACTORING_SUMMARY.md		REFACTORING_SUMMARY.md
REPLICATION_LAG_OPTIMIZATION.md		REPLICATION_LAG_OPTIMIZATION.md
RUN_CLOUD_E2E.md		RUN_CLOUD_E2E.md
Readme.md		Readme.md
SCALABILITY_WITHIN_CONSTRAINT.md		SCALABILITY_WITHIN_CONSTRAINT.md
SCORING_METHODS.md		SCORING_METHODS.md
SETUP.md		SETUP.md
SYSTEM_DIAGRAM.md		SYSTEM_DIAGRAM.md
TENSORFLOW_ML_GUIDE.md		TENSORFLOW_ML_GUIDE.md
TEST_1M_SUMMARY.md		TEST_1M_SUMMARY.md
THROUGHPUT_OPTIMIZATION.md		THROUGHPUT_OPTIMIZATION.md
arch.drawio		arch.drawio
dashboard_latency_results.json		dashboard_latency_results.json
data_availability_results.json		data_availability_results.json
docker-compose.yml		docker-compose.yml
query_latency_results.json		query_latency_results.json
replication_lag_results.json		replication_lag_results.json
test_data.json		test_data.json
test_data.py		test_data.py
throughput_benchmark_results.json		throughput_benchmark_results.json

Folders and files

Latest commit

History

Repository files navigation

PhishNChips - Distributed Phishing Intelligence Platform

🚀 Quick Start

📋 Project Overview

Features

Distributed Database Concepts Demonstrated

📚 Documentation

🏗️ Architecture

🛠️ Technology Stack

📦 Project Structure

🧪 Testing & Evaluation

Fault Tolerance Testing

Scalability Testing

Data Generation

📊 Performance Metrics

🌟 Key Achievements

🔧 Development Commands

🌐 Browser Extension

📝 API Endpoints

🔐 Security Note

📖 Academic Context

🤝 Contributing

📄 License

👥 Authors

🙏 Acknowledgments

Original Project Proposal

1. Introduction

1.1. Background

1.2. Problem Statement

1.3. Objectives

2. System Architecture and Design

2.1 System Design

2.2 Implementation Plan

Programming Languages, Databases, and Tools

Development Approach

2.3 Data Strategy

3. Methodology

Technique 1 - Sharding and Replication in Elasticsearch

Technique 2 - Distributed Querying and Aggregation

Technique 3 - Fault Tolerance Testing

Technique 4 - Horizontal Scalability Evaluation

4.2 Expected Outcomes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages