A geo-distributed phishing intelligence network demonstrating distributed database concepts using Elasticsearch and Kibana. The system enables real-time threat detection and global dissemination of phishing intelligence through browser extensions and a distributed backend.
# Start the cluster
make quick-start
# Or manually:
docker-compose up -d
python3 scripts/setup_elasticsearch.py
python3 scripts/generate_data.py --count 1000 --send --balancedAccess Points:
- 🌐 Kibana Dashboard: http://localhost:5601
- 📊 API Docs: http://localhost:8000/docs
- 🔍 Elasticsearch: http://localhost:9200
- ✅ Distributed 3-Node Elasticsearch Cluster (US, EU, Asia regions)
- ✅ FastAPI Backend with real-time ingestion and querying
- ✅ Browser Extension for phishing detection and reporting
- ✅ Kibana Dashboards for real-time threat visualization
- ✅ Automated Testing for fault tolerance and scalability
- ✅ Synthetic Data Generator for testing with 100K-1M records
- ✅ Comprehensive Documentation and deployment guides
| Concept | Implementation |
|---|---|
| Sharding | 2 primary shards per index, distributed across nodes |
| Replication | RF=2, automatic failover on node failure |
| Distributed Queries | Cross-region aggregations and searches |
| Fault Tolerance | Survives single node failure with <60s recovery |
| Horizontal Scalability | Add nodes dynamically, automatic rebalancing |
| Consistency Models | Quorum-based writes, eventual consistency reads |
- SETUP.md - Detailed setup instructions and troubleshooting
- ARCHITECTURE.md - System design and distributed concepts
- DEPLOYMENT.md - Production deployment and scaling guide
Browser Extensions (US, EU, ASIA)
↓ HTTP POST
FastAPI Service
↓ Index
Elasticsearch Cluster (3 nodes)
↓ Replication
Kibana Dashboard
See ARCHITECTURE.md for detailed architecture diagrams and explanations.
- Backend: Python 3.11, FastAPI
- Database: Elasticsearch 8.11.0 (distributed cluster)
- Visualization: Kibana 8.11.0
- Frontend: Chrome Extension (JavaScript)
- Containerization: Docker & Docker Compose
- Testing: Python scripts with concurrent load testing
Proj_app/
├── backend/ # FastAPI application
│ ├── main.py # API endpoints
│ ├── config.py # Configuration
│ ├── Dockerfile # API container
│ └── requirements.txt # Python dependencies
├── browser-extension/ # Chrome extension
│ ├── manifest.json # Extension config
│ ├── background.js # Detection logic
│ ├── popup.html/js # User interface
│ └── content.js # Page analysis
├── scripts/ # Utility scripts
│ ├── generate_data.py # Synthetic data generator
│ ├── test_fault_tolerance.py # Fault tolerance tests
│ ├── test_scalability.py # Scalability tests
│ └── setup_elasticsearch.py # ES initialization
├── config/ # Configuration files
│ ├── elasticsearch_template.json # Index template
│ └── kibana_dashboards.ndjson # Dashboard config
├── docker-compose.yml # Container orchestration
├── Makefile # Command shortcuts
└── Documentation files # Setup, architecture, deployment
Test cluster resilience under node failures:
make test-fault
# Tests each node failure, measures recovery time and data availabilityMeasure performance improvements with horizontal scaling:
make test-scale
# Tests throughput and latency with different cluster configurationsGenerate synthetic phishing data:
# Generate 10,000 balanced reports
python3 scripts/generate_data.py --count 10000 --send --balanced
# Generate region-specific data
python3 scripts/generate_data.py --count 1000 --region US --sendBased on testing with 3-node cluster (8GB RAM, 4 CPU per node):
| Metric | Value |
|---|---|
| Write Throughput | ~800 reports/sec (concurrent) |
| Write Latency (P95) | 25ms |
| Query Latency (P95) | 40ms |
| Recovery Time | <60s after node failure |
| Data Capacity | Tested with 1M records |
| Scalability | +30% throughput with 4th node |
- ✅ End-to-end distributed phishing intelligence platform
- ✅ Demonstrates 6+ distributed database concepts
- ✅ Handles 100K-1M records with low latency
- ✅ Automatic failover with zero data loss
- ✅ Real-time threat visualization
- ✅ Production-ready architecture with monitoring
# Start all services
make start
# Setup Elasticsearch indices
make setup-es
# Generate test data
make generate-data
# Run fault tolerance tests
make test-fault
# View logs
make logs
# Stop services
make stop
# Clean up everything
make cleanInstall the PhishNChips extension to detect and report phishing sites:
- Generate icons:
make icons - Open Chrome →
chrome://extensions/ - Enable "Developer mode"
- Click "Load unpacked" → Select
browser-extension/folder - Visit any website to see the risk analysis
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check |
/report |
POST | Submit phishing report |
/threats |
GET | Query recent threats |
/hotspots |
GET | Regional threat statistics |
/cluster/health |
GET | Cluster health metrics |
/stats |
GET | System statistics |
Full API documentation: http://localhost:8000/docs
Current implementation is for development/demonstration only. For production:
- Enable X-Pack Security with TLS
- Implement API authentication
- Use HTTPS everywhere
- Enable audit logging
- See DEPLOYMENT.md for security hardening
This project demonstrates distributed database systems concepts including:
- Partitioning & Sharding - Data distribution across nodes
- Replication - Fault tolerance and availability
- Consistency Models - CAP theorem tradeoffs
- Distributed Queries - Cross-node aggregations
- Fault Tolerance - Automatic recovery
- Horizontal Scalability - Linear performance improvements
This is an academic project for demonstration purposes. Suggestions and improvements welcome!
MIT License - See LICENSE file for details
Created as part of Distributed Database Systems course project.
- Elasticsearch documentation and community
- FastAPI framework
- Chrome extension development guides
- Distributed systems research papers
In this contemporary digital age, phishing is still one of the most compromising cyber threats where hundreds or thousands of new malicious websites or domains are created every single day. Most traditional detection systems depend on a centralized database that periodically updates threat signatures. This centralized setup often leads to delays, single points of failure, and puts users located in different geographies at risk against new phishing attacks that are unique to their geographic locations. Distributed database systems provide an alternative to this approach in terms of where data is stored, replicated, and queried, and allow it to be distributed across multiple nodes and regions. Using technologies like Elasticsearch and Kibana, ingestion, analysis, and visualization of real-time browser security data can be done in a fault tolerant and scalable manner. [1] PhishNChips envisions a distributed phishing intelligence network where browsers located in different regions act as sensors, sending notices of suspicious URLs to Elasticsearch clusters in their location. The clusters can share and replicate threat intelligence globally, which provides additional protection to users that would not otherwise be aware of the newly created threat. Through the PhishNChips project, we demonstrate how replication, partitioning, fault tolerance and distributed querying can be leveraged in a real-world cybersecurity use case.
Phishing campaigns progress at a greater rate than traditional threat detection mechanisms. There are available browser extensions that can identify a malicious web site on the local machine, but that information often stays in a silo or propagates globally in a delayed manner [2] [3] [4]. PhishNChips addresses this challenge by creating a distributed, real-time phishing intelligence platform that leverages Elasticsearch [1] [5]. Each node contains a repository of phishing reporting from their local region and globally replicates the data sets to create awareness on a geo-redundant basis. The system allows for scalable ingestion, fault-tolerant data replication, and live visualization with dashboards created in Kibana.
- To architect and deploy a geo-distributed phishing intelligence platform on the basis of Elasticsearch’s multi-cluster abilities.
- To enable an end-to-end data pipeline, with a real-time threat reporting browser client and backend API for querying and data ingestion [1] [3] [4].
- To ensure data consistency, high availability, and scalability in a distributed system, with special emphasis on cross-cluster replication and failover automation using millions of records.
- To measure the system's performance, dependability, and latency in disseminating data under different conditions, such as simulated regional node failure and network partition.
- To create a real-time visualization dashboard with Kibana presenting a global, composite picture of the phishing threats in an illustration of the power of distributed querying [1].
Deliverables:
- Browser Extension & Data Collection
- Distributed Elasticsearch Cluster Setup
- API and Backend Services
- Kibana Dashboards and Evaluation
Fig.1 High-Level Architecture of the System Approach using Elastic Search and Kibana
Data Collection To identify suspicious URLs, browser extensions leverage well-defined heuristic or machine-learning (ML) scoring approaches (e.g., unusual patterns in domain, login forms present on a non-HTTPS page). For now, we are confining our scope of the project to URLs collection and threat detections.
Each browser extension sends a JSON report to a remote endpoint that looks something like the following:
{
"url": "[http://example-login.net](http://example-login.net)",
"risk_score": 0.92,
"region": "US",
"timestamp": "2025-10-20T14:00:00Z"
}Data Ingestion Layer A Python FastAPI service receives incoming reports and indexes them to its local Elasticsearch node. Each regional node (i.e., us-east, eu-central, asia-south) will store reports to indices with the region tagged (i.e., phish-us, phish-eu, phish-asia).
Elasticsearch Cluster
- Multi-node cluster with sharding and replication [7].
- An index template specifies mappings for url, risk_score, region, and timestamp.
- Automatic replication makes sure that the threat data from one region is available across nodes in other regions.
- Elasticsearch's distributed query engine makes real-time global searches possible on all nodes.
Kibana Visualization Kibana dashboards display:
- A real-time indicator of phishing alerts
- Geographic heatmaps of collected URLs
- Time-series graphs of phishing occurrences over time
- Brand impersonation detection and analysis
Replication and Fault Tolerance
- Document shards are replicated across nodes and there is an automatic reallocation of shards to ensure continuity of operations if a node fails to respond.
- The cluster provides near-real-time read/write consistency using quorum-based strategies [1].
- The cluster will recover automatically after the node restarts.
API Service The backend provides:
POST /report(phishing report submission)GET /threats(recent threats retrieval)GET /hotspots(aggregate of stats based on region)
Our implementation strategy is based on an up-to-date, containerized technology stack that allows us to realistically simulate a geo-distributed setup on local hardware [5]. We plan to do our development in logical phases, starting with the backend infrastructure and incrementally building out the data collection, visualization, and evaluation layers.
- Backend and API: Our core API service will be implemented in Python 3.11, utilizing the high-performance FastAPI framework to implement the ingestion and query endpoints. This enables rapid development and asynchronous request handling, ideal for a real-time data pipeline [1].
- Distributed Database: Elasticsearch will be our primary distributed database system. We will leverage its native support for sharding, replication, and distributed querying to store and manage the phishing report data [5].
- Visualization and Analytics: Kibana will be our primary tool for creating interactive dashboards and visualizing the real-time threat intelligence data stored in Elasticsearch.
- Containerization and Simulation: Docker Compose is also a central component of our solution. It will be used to define and run a multi-node, multi-region Elasticsearch cluster on a single host, providing a handy and manageable simulation of a real-world distributed environment.
- Frontend Data Source: A simple Browser Extension with standard web technologies (JavaScript, HTML, CSS) will be developed to act as the client-side data sensor.
- Phishing Detection Model: We plan to implement a threat detection model within the extension, which might be imposed using a lightweight machine learning model like DistilBERT or a simpler heuristic-based model [2].
The project will be implemented in a series of integrated steps:
- Distributed Cluster Setup: Most of the project’s workload is used up in setting up the distributed environment. Using Docker Compose, we roll out and set up a three-node Elasticsearch cluster. The setup will be adapted to simulate regional nodes, with indices specified to segregate data from different geolocation regions (e.g., phish-us, phish-eu, phish-asia).
- API Layer Development: Once the cluster comes online, we will run the API layer in FastAPI. The service will be the single entry point for data ingestion, with one endpoint to ingest reports through the browser [3] [4]. The API will validate the incoming data and take advantage of Elasticsearch's bulk indexing capability to do bulk efficient high-throughput data insertion.
- ML Model Deployment and Ingest Pipeline Configuration: We shall directly load a pre-trained phishing detection model (e.g., a heuristic or NLP-based like DistilBERT) into Elasticsearch [2]. An ingest pipeline will be configured to run inference with this model. When the API inserts a new URL into the cluster, this pipeline would be automatically utilized, would generate a risk_score, and append it to the document before indexing.
- Client-Side Data Collection: At the same time, a simple Chrome extension will be created [2] [3] [4]. It will have a single function: to check (initially, through simulation) dubious URLs and pass them on in nicely formatted JSON to the API ingestion endpoint.
- Demonstration of Distributed Features: As data is flowing into the cluster, we will focus on testing and exhibiting the inherent distributed characteristics. Node failure at a region will be emulated by shutting down a Docker container and, in addition to Elasticsearch internal metrics, our API will be used to test the auto-failover and self-healing of the cluster.
- Dashboard Configuration and Evaluation: Finally, we add Kibana to our cluster for the installation of live visualization dashboards. We statistically capture major performance measurements, such as replication lag, query latency, and overall system fault tolerance across test failures, in the final test evaluation stage.
Data Generation Creation of synthetic phishing data will be used for detection across multiple geographic regions.
Nature of Data
- Volume: 100k-1M simulated phishing reports.
- Primary fields: url, risk_score, region, timestamp.
- Indexing strategy - reports will be indexed by region for improved cross-regional and regional searching
Data Privacy All data will be synthetic and anonymized. No personally identifiable information (PII) will be obtained or stored.
- Scope: To partition phishing reports into regions and copy them from one node to another in the cluster.
- Approach:
We will have independent, region-based indices such as
phish-us,phish-eu, andphish-asia. One master shard and a minimum of one replica shard will be configured for each index. We will then induce a regional node failure to see how the data becomes available. - Metrics: We will see replication lag between nodes every so often, typical operation query latency, and recovery time of a shard in case of failure. [6]
- Desired Outcome: To demonstrate distributed data redundancy and fault tolerance.
- Scope: To run complex queries across multiple nodes and geographies to create global insights.
- Approach:
- With Kibana, we will run cross-index queries (e.g., give all phishing reports with
risk_score> 0.8 that happened in the last hour). - We also intend to visualize time-series aggregations and geospatial heatmaps.
- With Kibana, we will run cross-index queries (e.g., give all phishing reports with
- Metrics: The main metrics will be query latency, i.e., the p50 and p95 response times, and the overall aggregation efficiency. We will also measure the dashboard update latency and visualization response time on the UI side to have seamless user experience.
- Desired Outcome: To show that our distributed architecture can provide actionable real-time insights with minimal latency.
- Scope: Systematically analyze the recovery behavior and resilience of the system when a sudden node failure occurs.
- Approach:
- We will intentionally take one of the nodes in our Elasticsearch cluster offline to create a failure event.
- During the downtime, we will keep bringing fresh reports to ensure there is no data loss and observe the automatic self-recovery process of the cluster as it continues on replica shards.
- Metrics: We will measure the total cluster recovery time, the overall system uptime percentage, and the error rate for requests made during the failure event. Furthermore, we will also estimate the Shard Reallocation Efficiency and check for zero data loss. [6]
- Desired Outcome: To validate the system's resilience and its automated self-healing capabilities in a distributed setup.
- Scope: Adding more nodes to the cluster will increase the overall system performance
- Approach: To showcase this, we will add a fourth node into our already setup 3-node cluster using Docker Compose. By nature, the cluster will rebalance its shards to these 4 nodes. Once this is done, we will re-run our ingestion pipeline and tests for the performance improvement.
- Metrics: We will measure the effect of horizontal scalability with respect to the increase in Indexing Throughput (reports/sec) and a reduction in Query Latency [6].
- Desired Outcome: Through this technique, we want to show that our system is capable of horizontal scaling in case of unexpected increase in data in-flow and user loads, showcasing the advantages of distributed database systems.
- Real-Time Threat Detection and Global Dissemination
- Achieve an average ingestion latency of less than 200ms for phishing reports submitted via the browser extensions.
- Achieve distribution of new threats worldwide within seconds through Elasticsearch replication between regions.
- Enable the user and dashboard to query new phishing data immediately, demonstrating that they are aware of emerging attacks.
- Verify that when a phishing URL has been identified in one region, it is immediately visible to each other region without having to be updated manually.
- Scalable, Distributed Intelligence Network
- To support up to 5000 phishing reports per second using the FastAPI ingestion layer.
- Easily scale by adding nodes or regions to the Elasticsearch infrastructure without downtime or major reconfiguration.
- To illustrate the balanced data distribution and query efficiency by partitioning ES indexes region-wise (e.g.
phish-us,phish-eu,phish-asia) - Sharding and automatic rebalancing demonstration during new nodes addition, or nodes failure, or nodes recovery [7].
- Fault Tolerance and Continuous Operations
- To provide uninterrupted operations during node or cluster outages, using an existing Elasticsearch installation with built-in replication and quorum-based consistency.
- Show zero data loss during simulated failures using replication factor (RF=2).
- Demonstrate automatic recovery time (<60s) when failed nodes are added back to the cluster.
- Demonstrate failover and recovery action using metrics and logs from Kibana.
- Consistency of Data and Reliable Queries
- Ensure accurate and up-to-date phishing data across all replicas and regions.
- Show eventual consistency in distributed replication without sacrificing read availability.
- Execute cross-region search and aggregation queries that demonstrate correctness and responsiveness.
- Track query latency (p50, p95) and remain current with the dashboards even while replicating.
- Observability and Threat Analysis
- Provide rich dashboards that show in real time in Kibana including:
- Incoming phishing alerts and detection time on the timelines.
- Geospatial heatmaps of phish-threatened locations shown by region.
- Time-series charts showing phishing trends and spikes.
- Brand impersonation analytics showing domain names and content-based patterns.
- Share cluster health metrics that provide indexing rate, replication lag, and query latency.
- Visualize performance and stability graphs showing metrics before, during, and after each node failure.
- Provide rich dashboards that show in real time in Kibana including:
- Illustration of Distributed Databases Principles
- We would be sharing a working prototype that will demonstrate distributed databases functionality end-to-end, maximizing below features like:
- Partitioning, replication, fault tolerance, and efficient distributed querying.
- A discussion of latency, consistency, and availability under a variety of configurations or settings.
- Application of this concept in a real-world cybersecurity scenario.
- We would be sharing a working prototype that will demonstrate distributed databases functionality end-to-end, maximizing below features like: