Semantics-aware Android Malware detector

The implementation of a malware detection system proposed for the MSC thesis Semantics-aware Malware Detection using Natural Language Processing on behavioural reports analysis at the University of Southampton.

1. Overview

The aim of the project is to develop a generic malware detection tool which will be able to recognise malware based on behavioural reports generated during dynamic and static analysis of Android applications. System implements conventional bag-of-words TF-IDF text vectorization along with Paragraph Vectors models and language models based on Transformer architecture. It offers a comparison between the models performance.

The following figure shows the system overview. Behavioural reports are generated separately in CuckooDroid sandbox and then passed to the system for classification.

2. Installation

The system is implemented with Python 3.7.4, using libraries specified in requirements.txt. To install dependencies, run:

pip3 install -r requirements.txt

2.1. Download pre-trained models

To use already pre-trained models, go to the link and download folders with models, you wish to use. Then unzip and paste the whole folder to saved_models/ directory.

Five pre-trained models are available:

Folder name	Embedding	Classifiers
tfidf_...	TF-IDF	XGBoost, SVC
d2v_...^*	Doc2Vec DM and DBOW models	XGBoost, SVC
bert_...	BERT	^**
roberta_...	RoBERTa	^**
distilbert_...	DistilBERT	^**

^* Download all d2v_... folders
^** Classification head is part of the model

2.2. Download dataset

You can download already generated dataset reports.json from link. Paste the downloaded dataset to datasets/ directory.

Dataset was assembled from publicly available University of New Brunswick CICInvesAndMal2019 dataset and supplemented with benign apps from Benign 2015 and Benign 2017 datasets (link).

You can also create own dataset from behavioural reports, using the CuckooDroid sandbox and downloader.py script. More can be found in 2.2.1 section.

2.2.1. Create reports dataset

To create a JSON dataset from behavioural reports, on which you can perform model's training, script downloader.py can be used. Generated reports in report_dir have to be split to subdirectories, so that benign reports are in different subdirectory than malware reports. Names of benign subdirectories need to be specified in dowloader.py. Script then pre-process the reports and save them to JSON file, which can be used in a system. Following tree shows the possible structure in report_dir:

reports_dir
└── benign_1
└── benign_2
└── malware_1
└── malware_2
    └── report_1
    └── report_2

downloader.py script removes attributes virustotal and signatures from behavioural reports, because they contain classification results from VirusTotal API, which would positively affect the results of our detector.

3. Operation modes

The system works in two operation modes. The user can either train and evaluate the models on a dataset of behavioural reports or run a classification of a single report with one of the pre-trained models.

3.1. Training and evaluation mode

Script train_evaluate_model.py is used for training and evaluation of models based on the dataset of behavioural reports. The absolute path to dataset needs to be supplied in JSON format, specified by --dataset_path parameter. Requested embedding and classification method can be selected. Parameters --train_emb and --train_cls allow to control, whether new text embedding (TF-IDF, Paragraph Vectors) and classification models (SVM, XGBoost) are trained or whether already trained models are used for prediction on test dataset. To specify one of the Transformer models, only --embedding and --train_emb parameters are required. Following table shows a list of all parameters.

Argument	Description	Value
--dataset_path	Absolute path to dataset of behavioural reports	<PATH>
--embedding	Requested embedding model	<tfidf,doc2vec,bert, roberta,distilbert>
--classifier	Requested classifier for TF-IDF and Doc2Vec embedding	<svc,xgb>^*
--train_emb	Mark, whether to train new model, or use already trained one from `saved_models/` dir
--train_cls	Mark, whether to train new classifier, or use already trained one from `saved_models/` dir	^*

^* Applicable only for TF-IDF and Doc2Vec embedding

EXAMPLE: To train a new XGBoost classifier with TF-IDF embedding, simply run the following command:

python3 train_evaluate_model.py --dataset_path datasets/reports.json --embedding tfidf --classifier xgboost --train_emb --train_cls

3.2. Report Classification

Script predict_report.py allows conducting a classification of a single behavioural report. To run the operation, the saved_models/ directory needs to contain trained models. The report's absolute path is supplied by argument --report_path. Requested embedding and classification method is selected by parameters --embedding and --classifier. List of all args can be found in table:

Argument	Description	Value
--report_path	Absolute path to behavioural report	<PATH>
--embedding	Requested embedding model	<tfidf,doc2vec,bert, roberta,distilbert>
--classifier	Requested classifier for TF-IDF and Doc2Vec embedding	<svc,xgb>^*

^* Applicable only for TF-IDF and Doc2Vec embedding

EXAMPLE: To classify a report with XGBoost classifier and TF-IDF embedding, simply run:

python3 predict_report.py --report_path report_samples/report_malware.json --embedding tfidf --classifier xgboost

Directory /report_samples contains two already generated reports.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
classifiers		classifiers
config		config
datasets		datasets
embeddings		embeddings
graphs		graphs
helpers		helpers
report_samples		report_samples
saved_models		saved_models
.gitignore		.gitignore
README.md		README.md
downloader.py		downloader.py
predict_report.py		predict_report.py
requirements.txt		requirements.txt
train_evaluate_model.py		train_evaluate_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantics-aware Android Malware detector

1. Overview

2. Installation

2.1. Download pre-trained models

2.2. Download dataset

2.2.1. Create reports dataset

3. Operation modes

3.1. Training and evaluation mode

3.2. Report Classification

About

Uh oh!

Releases

Packages

Languages

iblfilip/malware_detector

Folders and files

Latest commit

History

Repository files navigation

Semantics-aware Android Malware detector

1. Overview

2. Installation

2.1. Download pre-trained models

2.2. Download dataset

2.2.1. Create reports dataset

3. Operation modes

3.1. Training and evaluation mode

3.2. Report Classification

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages