The phishnet.py script serves as a comprehensive utility for phishing email detection using natural language processing (NLP) and machine learning (ML). It combines preprocessing, simulation, classification, and secure data transfer into a single executable tool. Designed for a dual-VM virtual testbed, this script automates much of the forensic and analytical workflow needed for large-scale phishing detection and experimentation. When executed, the script presents the user with three operational modes. The first option, Prepare Dataset, initiates the complete preprocessing pipeline. It combines seven real-world phishing and ham datasets—including CEAS_08, Enron, Ling, Nigerian Fraud, and others—into a single corpus. The script applies cleaning steps such as HTML tag removal using BeautifulSoup, stopword filtering, and lemmatization with NLTK. The cleaned email bodies are then vectorized using TF-IDF with a vocabulary capped at 5000 features to prevent overfitting. Once the cleaned dataset is generated and saved as cleaned_emails.csv, it is securely transferred to the Defender VM using the SCP protocol, enabling a distributed analysis workflow across two isolated virtual machines. The second mode, Simulate Phishing Email, demonstrates how phishing payloads might be crafted and transmitted in a controlled environment. This feature uses the swaks tool to emulate a phishing email being sent from the attacker VM (spoofed as yuvindeakin@gmail.com) to the defender VM (deakindefender@gmail.com). Users are prompted to provide an email subject and body, allowing them to simulate realistic phishing content. The email is sent to localhost, demonstrating phishing behavior without involving an actual SMTP server or the internet. While no real email leaves the testbed, this simulation effectively illustrates how phishing campaigns are initiated and how such traffic might appear during forensic analysis. The third and final mode, Train and Evaluate ML Models, allows users to apply supervised learning techniques on the cleaned dataset. The system loads the cleaned_emails.csv file and uses Scikit-learn to perform TF-IDF vectorization, followed by an 80/20 train-test split. Two classifiers—Support Vector Machine (SVM) and Random Forest (RF)—are trained on this split data. The system prints out performance metrics, including precision, recall, f1-score, and confusion matrices for both models. Additionally, a graphical confusion matrix for the Random Forest classifier is saved as rf_confusion_matrix.png in the project directory. To ensure proper functionality, the environment requires key libraries such as pandas, scikit-learn, nltk, matplotlib, seaborn, and beautifulsoup4, along with system tools like swaks and openssh-server. The script also uses NLTK corpora, so stopwords and wordnet must be downloaded beforehand. Once configured, the script runs entirely from the terminal and consolidates practical ML workflows into an interactive and educational utility. PhishNet+ was developed within a dual-VM architecture running under a pfSense-managed network. The Attacker VM (Kali 2024) prepares and sends data, while the Defender VM (Kali 2025) serves as the recipient and optional analysis node. This setup offers students, researchers, and practitioners a safe sandbox for executing realistic phishing detection experiments. In summary, the phishnet.py script embodies a fully-contained research and demonstration pipeline, translating theoretical phishing detection into a tangible, executable form. Whether used for forensic training, prototype development, or ML experimentation, PhishNet+ provides an extensible platform to explore the real-world challenges of email-based threats.
-
Notifications
You must be signed in to change notification settings - Fork 0
YuvinD/SIT327-FinalProject
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published