Multi-sieve deep learning pipeline for malicious URL detection
The Neural Sieve Cascade (NSC) is a three-stage malicious URL detection framework designed to balance speed, accuracy, and real-time feasibility. Instead of sending every URL to heavy models, NSC filters URLs progressively.
| Sieve | Model Type | Purpose | URLs Handled (≈) |
|---|---|---|---|
| Sieve-1 | Logistic Regression (TF-IDF) | Fast filtering of clear benign/malicious URLs | 75% |
| Sieve-2 | CNN + LSTM + BiLSTM (Soft-Voting) | Handles structurally ambiguous / obfuscated URLs | 14% |
| Sieve-3 | TinyBERT | Resolves the hardest adversarial/phishing cases | 11% |
Total URLs: 651,191
Classes: Benign, Defacement, Phishing, Malware
Example:
| url | label |
|---|---|
http://secure-login.bank.verify-pay.com |
phishing |
https://google.com |
benign |
Original source: Kaggle – malicious-urls-dataset.
- TF-IDF on character n-grams
- Ultra-fast inference; accepts predictions with ≥ 0.90 confidence
- CNN: catches lexical tricks (e.g.,
paypa1.com) - LSTM: long-range token dependencies
- BiLSTM: forward + backward context
- Soft-voting; accepts with ≥ 0.90 confidence
- Handles adversarial/semantic manipulations
- Tuned to prioritize recall for phishing & malware
| Model | Accuracy (%) |
|---|---|
| Logistic Regression | 99.0 |
| CNN | 93.45 |
| LSTM | 90.48 |
| BiLSTM | 89.93 |
| TinyBERT | 94.86 |
| NSC (Final) | 97.28 |
Final Confusion Matrix
Per-class Metrics
| Class | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|
| Benign | 97 | 99 | 98 |
| Defacement | 99 | 99 | 99 |
| Malware | 99 | 93 | 96 |
| Phishing | 95 | 87 | 91 |
Additional plots:
# 1) Create env (optional)
python -m venv .venv && .\.venv\Scripts\activate # Windows PowerShell
# 2) Install
pip install -r requirements.txt
# 3) Run the notebook
jupyter notebook notebooks/NSC_Final.ipynb




