The implementation of a malware detection system proposed for the MSC thesis Semantics-aware Malware Detection using Natural Language Processing on behavioural reports analysis at the University of Southampton.
The aim of the project is to develop a generic malware detection tool which will be able to recognise malware based on behavioural reports generated during dynamic and static analysis of Android applications. System implements conventional bag-of-words TF-IDF text vectorization along with Paragraph Vectors models and language models based on Transformer architecture. It offers a comparison between the models performance.
The following figure shows the system overview. Behavioural reports are
generated separately in CuckooDroid
sandbox and then passed to the system for classification.

The system is implemented with Python 3.7.4, using libraries specified in
requirements.txt. To install dependencies, run:
pip3 install -r requirements.txt
To use already pre-trained models, go to the
link
and download folders with models, you wish to use. Then unzip and paste
the whole folder to saved_models/ directory.
Five pre-trained models are available:
| Folder name | Embedding | Classifiers |
|---|---|---|
| tfidf_... | TF-IDF | XGBoost, SVC |
| d2v_...* | Doc2Vec DM and DBOW models | XGBoost, SVC |
| bert_... | BERT | ** |
| roberta_... | RoBERTa | ** |
| distilbert_... | DistilBERT | ** |
* Download all d2v_... folders
** Classification head is part of the model
You can download already generated dataset reports.json from
link.
Paste the downloaded dataset to datasets/ directory.
Dataset was assembled from publicly available University of New Brunswick CICInvesAndMal2019 dataset and supplemented with benign apps from Benign 2015 and Benign 2017 datasets (link).
You can also create own dataset from behavioural reports, using the
CuckooDroid sandbox and downloader.py script. More can be found in 2.2.1 section.
To create a JSON dataset from behavioural reports, on which
you can perform model's training, script
downloader.py can be used. Generated reports in report_dir
have to be split to subdirectories, so that benign reports
are in different subdirectory than malware reports. Names of
benign subdirectories need to be specified in dowloader.py. Script
then pre-process the reports and save them to JSON file, which
can be used in a system. Following tree shows the possible
structure in report_dir:
reports_dir
└── benign_1
└── benign_2
└── malware_1
└── malware_2
└── report_1
└── report_2
downloader.py script removes attributes virustotal and signatures
from behavioural reports, because they contain classification results from
VirusTotal API, which would positively affect the results of our detector.
The system works in two operation modes. The user can either train and evaluate the models on a dataset of behavioural reports or run a classification of a single report with one of the pre-trained models.
Script train_evaluate_model.py is used for training and evaluation
of models based on the dataset of behavioural reports. The absolute
path to dataset needs to be supplied in JSON format, specified by
--dataset_path parameter. Requested embedding and classification
method can be selected. Parameters --train_emb and --train_cls
allow to control, whether new text embedding (TF-IDF, Paragraph
Vectors) and classification models (SVM, XGBoost) are trained or
whether already trained models are used for prediction on test
dataset. To specify one of the Transformer models, only
--embedding and --train_emb parameters are required. Following
table shows a list of all parameters.
| Argument | Description | Value |
|---|---|---|
| --dataset_path | Absolute path to dataset of behavioural reports | <PATH> |
| --embedding | Requested embedding model | <tfidf,doc2vec,bert, roberta,distilbert> |
| --classifier | Requested classifier for TF-IDF and Doc2Vec embedding | <svc,xgb>* |
| --train_emb | Mark, whether to train new model, or use already trained one from saved_models/ dir |
|
| --train_cls | Mark, whether to train new classifier, or use already trained one from saved_models/ dir |
* |
* Applicable only for TF-IDF and Doc2Vec embedding
EXAMPLE: To train a new XGBoost classifier with TF-IDF embedding, simply run the following command:
python3 train_evaluate_model.py --dataset_path datasets/reports.json --embedding tfidf --classifier xgboost --train_emb --train_cls
Script predict_report.py allows conducting a classification of
a single behavioural report. To run the operation, the saved_models/
directory needs to contain trained models.
The report's absolute path is supplied by argument --report_path.
Requested embedding and classification method is selected by
parameters --embedding and --classifier. List of all args
can be found in table:
| Argument | Description | Value |
|---|---|---|
| --report_path | Absolute path to behavioural report | <PATH> |
| --embedding | Requested embedding model | <tfidf,doc2vec,bert, roberta,distilbert> |
| --classifier | Requested classifier for TF-IDF and Doc2Vec embedding | <svc,xgb>* |
* Applicable only for TF-IDF and Doc2Vec embedding
EXAMPLE: To classify a report with XGBoost classifier and TF-IDF embedding, simply run:
python3 predict_report.py --report_path report_samples/report_malware.json --embedding tfidf --classifier xgboost
Directory /report_samples contains two already generated reports.