Advanced Hybrid Phishing Detection System (Rule-Based + Deep Learning)
PhishNet is a full-stack cybersecurity tool that detects phishing URLs in real-time. It utilizes a Hybrid Architecture combining a high-speed whitelist engine with a custom Deep Learning Y-Network (Bi-LSTM + Dense layers) to analyze URL patterns and protect users from malicious attacks.
Scanning a phishing URL and getting real-time analysis with confidence score
- π§ Hybrid Detection Engine - Instantly validates trusted domains (Whitelist) before engaging AI.
- π€ Deep Learning Model - Custom "Y-Network" architecture (Bi-LSTM for text + Dense for metadata).
- β‘ Real-time Analysis - Millisecond-latency scanning.
- π¨ Cyberpunk UI - Modern interface built with React, TypeScript, and Tailwind CSS v4.
- π Explainable AI (XAI) - Generates a "Detailed Analysis Report" explaining why a URL was flagged (e.g., "Too many dots", "IP address usage").
- π Secure Sandbox - Analyzes URL strings lexically without visiting or executing the malicious site.
| Feature | PhishNet | Traditional Tools |
|---|---|---|
| Real-time Detection | β Instant | β Instant |
| Explainable Results | β Yes (3-5 reasons) | β Black box |
| Hybrid Approach | β Whitelist + AI | β AI only |
| No API Dependencies | β Self-contained | β Requires internet |
| Fast Processing | β <10ms per URL | |
| Privacy-Focused | β Local processing | β Sends URLs to servers |
- Python 3.10+
- FastAPI - High-performance Async API
- TensorFlow/Keras - Deep Learning Framework
- Scikit-learn - Feature Scaling & Preprocessing
- Joblib/Pickle - Artifact serialization
- React 18 (TypeScript)
- Vite - Next-gen frontend tooling
- Tailwind CSS v4 - Utility-first styling engine
- Lucide React - Modern iconography
- Axios - Promise-based HTTP client
PhishNet/
βββ backend/
β βββ app/
β β βββ main.py # Server entry point
β β βββ api/endpoints.py # /scan route
β β βββ services/
β β βββ feature_extractor.py # Calculates 12 lexical features
β β βββ predictor.py # Hybrid (Whitelist + AI) logic
β βββ models/ # Saved .keras model
β βββ data/
β β βββ processed/ # Saved Tokenizer & Scaler
β βββ training/ # Jupyter Notebooks for training
β
βββ frontend/
βββ src/
β βββ components/ # Scanner & ResultCard components
β βββ types.ts # TypeScript interfaces
βββ vite.config.ts
- Python 3.8+
- Node.js 18+
- (Optional) Docker & Docker Compose for containerized deployment
Docker containerizes both the backend and frontend for easy, reproducible deployment.
Prerequisites:
- Docker Desktop (includes Docker Compose)
Steps:
- From the project root directory:
docker compose up --build-
Access the application:
- Frontend:
http://localhost(port 80) - Backend API:
http://localhost:8000/api/scan(port 8000)
- Frontend:
-
To stop:
docker compose downWhat's Included:
compose.yaml- Orchestrates both servicesbackend/Dockerfile- Python 3.10-slim with FastAPIfrontend/Dockerfile- Node.js build β Nginx serving.dockerignorefiles - Optimize build context & image size
Cleanup (reclaim disk space):
docker compose down
docker system prune -a --volumesNavigate to the backend directory and set up the Python environment:
cd backend
# Create virtual environment
python -m venv .venv
# Activate it
# Windows:
.venv\Scripts\activate
# Mac/Linux:
source .venv/bin/activate
# Install dependencies
pip install tensorflow pandas numpy scikit-learn matplotlib seaborn fastapi "uvicorn[standard]" python-multipart joblib
# (Optional) Freeze dependencies for future use
pip freeze > requirements.txt
Start the Server:
uvicorn app.main:app --reload
*Server is running at: http://127.0.0.1:8000*
Open a new terminal and navigate to the frontend directory:
cd frontend
# Install dependencies
npm install
# Start the Development Server
npm run dev
*Client is running at: http://localhost:5173*
- Open the application in your browser (
http://localhost:5173). - Enter a URL to analyze (e.g.,
http://secure-login-paypal-update.com). - Click SCAN.
- View the results:
- Risk Level: (Safe, Moderate, Critical)
- Confidence Score: AI certainty percentage.
- Analysis Report: Specific reasons for the flag.
The AI model utilizes a Multi-Input "Y-Network" Architecture:
- Branch 1 (Text Sequence): Character-level Embedding β Bi-Directional LSTM (Learns semantic patterns in the URL string).
- Branch 2 (Metadata): Dense Layers processing 12 calculated features (Length, IP presence, Dot count, Hyphen count, etc.).
- Fusion: Both branches are concatenated and passed through a final Dense layer with Sigmoid activation.
graph TD
A["π User Input<br/>Raw URL String"] --> B["π Feature Extraction<br/>12 Lexical Features"]
A --> C["π€ Tokenization<br/>Char-level Encoding<br/>Pad to 150"]
B --> D["βοΈ Scaling<br/>StandardScaler<br/>Mean=0, Std=1"]
C --> E["π Embedding<br/>5000β32 dims"]
E --> F["π Bi-Directional LSTM<br/>Read LβR & RβL<br/>Output: 64 dims"]
D --> G["π§ Dense Layer<br/>12β64β32 dims<br/>ReLU Activation"]
F --> H["π Concatenate<br/>64 + 32 = 96 dims"]
G --> H
H --> I["π§ Dense Layer<br/>96β64 dims<br/>+ Dropout 0.5"]
I --> J["π Sigmoid Output<br/>0 = Safe | 1 = Phishing<br/>Confidence: 0-100%"]
K["β
Whitelist<br/>Fast Pre-check"] -.->|"If matched"| L["Result: SAFE"]
J --> L
style A fill:#2563eb,stroke:#1e40af,color:#fff
style F fill:#dc2626,stroke:#991b1b,color:#fff
style G fill:#059669,stroke:#065f46,color:#fff
style H fill:#7c3aed,stroke:#6d28d9,color:#fff
style J fill:#d97706,stroke:#b45309,color:#fff
How data flows from the user's input to the final security verdict:
graph LR
subgraph Frontend["βοΈ FRONTEND<br/>React + TypeScript"]
A["Scanner Component<br/>URL Input Field"]
end
subgraph API["π BACKEND API<br/>FastAPI + Async"]
B["POST /api/scan<br/>Endpoint"]
C["Feature Extractor<br/>Calculate 12 Features"]
end
subgraph ML["π€ ML PIPELINE<br/>TensorFlow/Keras"]
D["Load Tokenizer<br/>tokenizer.pickle"]
E["Load Scaler<br/>scaler.joblib"]
F["Load Model<br/>phishnet_v1.keras"]
G["Predict<br/>Bi-LSTM + Dense"]
end
subgraph Whitelist["β
HYBRID ENGINE"]
H["Check Whitelist<br/>Known Safe Domains"]
end
subgraph Response["π RESPONSE"]
I["JSON Response<br/>is_phishing, confidence,<br/>risk_level, details"]
end
subgraph Display["π¨ DISPLAY"]
J["ResultCard<br/>Risk Badge + Report"]
end
A -->|"User enters URL"| B
B --> H
H -->|"Not in Whitelist"| C
C --> D
C --> E
D --> G
E --> G
F --> G
G --> I
H -->|"In Whitelist"| I
I --> J
style A fill:#3b82f6,stroke:#1e40af,color:#fff
style B fill:#ef4444,stroke:#991b1b,color:#fff
style G fill:#8b5cf6,stroke:#6d28d9,color:#fff
style I fill:#10b981,stroke:#047857,color:#fff
style J fill:#f59e0b,stroke:#b45309,color:#fff
- Training Split: 80% train, 20% test (with stratification)
- Model Optimization: Adam optimizer with binary crossentropy loss
- Early Stopping: Patience of 3 epochs to prevent overfitting
- Batch Size: 32 samples per batch
- Max Epochs: 20 (stopped early if no improvement)
The model analyzes these URL characteristics:
| # | Feature | Description | Why It Matters |
|---|---|---|---|
| 1 | length_url |
Total URL length | Phishing URLs are often longer than legitimate ones |
| 2 | length_hostname |
Hostname (domain) length | Suspicious domains tend to be longer |
| 3 | ip |
Uses IP address (0/1) | Indicates direct IP instead of domain name |
| 4 | nb_dots |
Number of dots | Many dots = suspicious subdomains |
| 5 | nb_hyphens |
Number of hyphens | Hyphens common in typosquatting attacks |
| 6 | nb_at |
"@" symbol count | Used to obscure destination domain |
| 7 | nb_qm |
"?" question marks | Indicates query parameters |
| 8 | nb_and |
"&" ampersand count | Multiple parameters (obfuscation) |
| 9 | nb_eq |
"=" equals count | Parameter assignment patterns |
| 10 | nb_slash |
"/" forward slashes | Path depth indicators |
| 11 | nb_colon |
":" colons | Protocol/port indicators |
| 12 | ratio_digits_url |
Digit to length ratio | Random number padding |
The model outputs a confidence score (0.0 to 1.0) that maps to risk levels:
| Risk Level | Confidence Range | Action |
|---|---|---|
| SAFE | 0% - 50% | β Low risk, safe to visit |
| MODERATE | 50% - 80% | |
| CRITICAL | 80% - 100% | π¨ High phishing probability, avoid |
Analyzes a URL for phishing indicators.
Request Body:
{
"url": "http://example-phish.com"
}Response:
{
"url": "http://example-phish.com",
"is_phishing": true,
"confidence_score": 0.98,
"risk_level": "CRITICAL",
"details": [
"β οΈ URL is suspiciously long",
"β οΈ Excessive number of dots detected",
"β οΈ Deep Learning pattern match"
]
}How data flows from the user's input to the final security verdict:
graph LR
subgraph Frontend["βοΈ FRONTEND<br/>React + TypeScript"]
A["Scanner Component<br/>URL Input Field"]
end
subgraph API["π BACKEND API<br/>FastAPI + Async"]
B["POST /api/scan<br/>Endpoint"]
C["Feature Extractor<br/>Calculate 12 Features"]
end
subgraph ML["π€ ML PIPELINE<br/>TensorFlow/Keras"]
D["Load Tokenizer<br/>tokenizer.pickle"]
E["Load Scaler<br/>scaler.joblib"]
F["Load Model<br/>phishnet_v1.keras"]
G["Predict<br/>Bi-LSTM + Dense"]
end
subgraph Whitelist["β
HYBRID ENGINE"]
H["Check Whitelist<br/>Known Safe Domains"]
end
subgraph Response["π RESPONSE"]
I["JSON Response<br/>is_phishing, confidence,<br/>risk_level, details"]
end
subgraph Display["π¨ DISPLAY"]
J["ResultCard<br/>Risk Badge + Report"]
end
A -->|"User enters URL"| B
B --> H
H -->|"Not in Whitelist"| C
C --> D
C --> E
D --> G
E --> G
F --> G
G --> I
H -->|"In Whitelist"| I
I --> J
style A fill:#3b82f6,stroke:#1e40af,color:#fff
style B fill:#ef4444,stroke:#991b1b,color:#fff
style G fill:#8b5cf6,stroke:#6d28d9,color:#fff
style I fill:#10b981,stroke:#047857,color:#fff
style J fill:#f59e0b,stroke:#b45309,color:#fff
PhishNet is also available as a Chrome/Firefox browser extension for real-time URL scanning while browsing.
Status: In Development
The extension will allow you to:
- Right-click on any link and check if it's phishing
- Display real-time security badges on search results
- Show confidence scores without leaving your browser
Want to retrain the model with your own dataset? Here's how:
cd backend
pip install jupyter-
Prepare Your Dataset
- Format: CSV with columns
urlandstatus(where status = "phishing" or "legitimate") - Place it in
backend/data/raw/phishing.csv - Minimum recommended: 5,000+ URLs (balanced classes)
- Format: CSV with columns
-
Run the Training Notebook
cd backend/training/notebooks jupyter notebook 01_eda_and_training.ipynb -
Execute All Cells (Shift + Enter)
- Cell 1: Load and explore data
- Cell 2: Clean and split
- Cell 3: Preprocessing (generates tokenizer + scaler)
- Cell 4: Build Y-Network architecture
- Cell 5: Train the model
-
New Artifacts Generated
backend/models/phishnet_v1.keras- Updated model weightsbackend/data/processed/tokenizer.pickle- Fitted tokenizerbackend/data/processed/scaler.joblib- Fitted scaler
After training, the notebook outputs validation metrics. To test on new URLs:
from app.services.predictor import PhishNetPredictor
predictor = PhishNetPredictor()
result = predictor.predict("https://suspicious-url.com")
print(result)Output:
{
"url": "https://suspicious-url.com",
"is_phishing": True,
"confidence_score": 0.87,
"risk_level": "CRITICAL",
"details": ["β οΈ Excessive number of dots...", "β οΈ Hyphens in domain..."]
}Contributions are welcome!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License.



