Skip to content

Android Malware Detector based on the application of NLP models on behavioural reports

Notifications You must be signed in to change notification settings

iblfilip/malware_detector

Repository files navigation

Semantics-aware Android Malware detector

The implementation of a malware detection system proposed for the MSC thesis Semantics-aware Malware Detection using Natural Language Processing on behavioural reports analysis at the University of Southampton.

1. Overview

The aim of the project is to develop a generic malware detection tool which will be able to recognise malware based on behavioural reports generated during dynamic and static analysis of Android applications. System implements conventional bag-of-words TF-IDF text vectorization along with Paragraph Vectors models and language models based on Transformer architecture. It offers a comparison between the models performance.

The following figure shows the system overview. Behavioural reports are generated separately in CuckooDroid sandbox and then passed to the system for classification. alt text

2. Installation

The system is implemented with Python 3.7.4, using libraries specified in requirements.txt. To install dependencies, run:

pip3 install -r requirements.txt

2.1. Download pre-trained models

To use already pre-trained models, go to the link and download folders with models, you wish to use. Then unzip and paste the whole folder to saved_models/ directory.

Five pre-trained models are available:

Folder name Embedding Classifiers
tfidf_... TF-IDF XGBoost, SVC
d2v_...* Doc2Vec DM and DBOW models XGBoost, SVC
bert_... BERT **
roberta_... RoBERTa **
distilbert_... DistilBERT **

* Download all d2v_... folders
** Classification head is part of the model

2.2. Download dataset

You can download already generated dataset reports.json from link. Paste the downloaded dataset to datasets/ directory.

Dataset was assembled from publicly available University of New Brunswick CICInvesAndMal2019 dataset and supplemented with benign apps from Benign 2015 and Benign 2017 datasets (link).

You can also create own dataset from behavioural reports, using the CuckooDroid sandbox and downloader.py script. More can be found in 2.2.1 section.

2.2.1. Create reports dataset

To create a JSON dataset from behavioural reports, on which you can perform model's training, script downloader.py can be used. Generated reports in report_dir have to be split to subdirectories, so that benign reports are in different subdirectory than malware reports. Names of benign subdirectories need to be specified in dowloader.py. Script then pre-process the reports and save them to JSON file, which can be used in a system. Following tree shows the possible structure in report_dir:

reports_dir
└── benign_1
└── benign_2
└── malware_1
└── malware_2
    └── report_1
    └── report_2

downloader.py script removes attributes virustotal and signatures from behavioural reports, because they contain classification results from VirusTotal API, which would positively affect the results of our detector.

3. Operation modes

The system works in two operation modes. The user can either train and evaluate the models on a dataset of behavioural reports or run a classification of a single report with one of the pre-trained models.

3.1. Training and evaluation mode

Script train_evaluate_model.py is used for training and evaluation of models based on the dataset of behavioural reports. The absolute path to dataset needs to be supplied in JSON format, specified by --dataset_path parameter. Requested embedding and classification method can be selected. Parameters --train_emb and --train_cls allow to control, whether new text embedding (TF-IDF, Paragraph Vectors) and classification models (SVM, XGBoost) are trained or whether already trained models are used for prediction on test dataset. To specify one of the Transformer models, only --embedding and --train_emb parameters are required. Following table shows a list of all parameters.

Argument Description Value
--dataset_path Absolute path to dataset of behavioural reports <PATH>
--embedding Requested embedding model <tfidf,doc2vec,bert, roberta,distilbert>
--classifier Requested classifier for TF-IDF and Doc2Vec embedding <svc,xgb>*
--train_emb Mark, whether to train new model, or use already trained one from saved_models/ dir
--train_cls Mark, whether to train new classifier, or use already trained one from saved_models/ dir *

* Applicable only for TF-IDF and Doc2Vec embedding

EXAMPLE: To train a new XGBoost classifier with TF-IDF embedding, simply run the following command:

python3 train_evaluate_model.py --dataset_path datasets/reports.json --embedding tfidf --classifier xgboost --train_emb --train_cls

3.2. Report Classification

Script predict_report.py allows conducting a classification of a single behavioural report. To run the operation, the saved_models/ directory needs to contain trained models. The report's absolute path is supplied by argument --report_path. Requested embedding and classification method is selected by parameters --embedding and --classifier. List of all args can be found in table:

Argument Description Value
--report_path Absolute path to behavioural report <PATH>
--embedding Requested embedding model <tfidf,doc2vec,bert, roberta,distilbert>
--classifier Requested classifier for TF-IDF and Doc2Vec embedding <svc,xgb>*

* Applicable only for TF-IDF and Doc2Vec embedding

EXAMPLE: To classify a report with XGBoost classifier and TF-IDF embedding, simply run:

python3 predict_report.py --report_path report_samples/report_malware.json --embedding tfidf --classifier xgboost

Directory /report_samples contains two already generated reports.

About

Android Malware Detector based on the application of NLP models on behavioural reports

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages