Skip to content
/ nebula Public

Transformer Neural Network for Dynamic Malware Detection

License

Notifications You must be signed in to change notification settings

dtrizna/nebula

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

359 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Nebula

License GitHub last commit

follow on X follow on LinkedIn

Description

This repository is part of the Nebula: Self-Attention for Dynamic Malware Analysis publication.
Pre-print available on arxiv.

Repository Structure

  • All Nebula and alternative dynamic malware analysis models are under nebula/models directory.
  • Examples of usage are under scripts/ directory.
  • Code related to extracting emualtion traces from raw PE samples is under emulation/.

Datasets

Dataset used for experiments and pretraining downloadable from huggingface.co/datasets/dtrizna/quovadis-speakeasy.

Additionally, while not used in this project directly, EMBER feature vectors for the same malware samples are available on huggingface as well: huggingface.co/datasets/dtrizna/quovadis-ember.

This should allow to perform research on static and dynamic detection methodology cross-analyses.

Important: Dataset Structure and Entry Points

The Speakeasy emulator captures execution across multiple entry points per sample:

  • module_entry: The main executable entry point
  • thread: Any threads spawned during emulation

The HuggingFace dataset stores each entry point as a separate row. This means:

  • A sample spawning 100 threads = 101 rows in the dataset
  • Each row has ep_type field indicating its type
Approach Behavior
Nebula class (recommended) Automatically aggregates all entry points into single prediction
Direct dataset loading Each entry point treated separately - must aggregate manually

For correct per-sample predictions when loading dataset directly:

from datasets import load_dataset

dataset = load_dataset("dtrizna/quovadis-speakeasy")

# Option 1: Use only module_entry rows (ignores thread behavior)
module_entries = [row for row in dataset if row['ep_type'] == 'module_entry']

# Option 2: Group by sample using filename from API args, then aggregate

Alternatively, use raw JSON reports with PEDynamicFeatureExtractor.filter_and_normalize_report() which concatenates all entry points into a single representation.

Citation

If you find this code or data valuable, please cite us:

@ARTICLE{10551436,
  author={Trizna, Dmitrijs and Demetrio, Luca and Biggio, Battista and Roli, Fabio},
  journal={IEEE Transactions on Information Forensics and Security}, 
  title={Nebula: Self-Attention for Dynamic Malware Analysis}, 
  year={2024},
  volume={19},
  number={},
  pages={6155-6167},
  keywords={Malware;Feature extraction;Data models;Analytical models;Long short term memory;Task analysis;Encoding;Malware;transformers;dynamic analysis;convolutional neural networks},
  doi={10.1109/TIFS.2024.3409083}}


@inproceedings{10.1145/3560830.3563726,
author = {Trizna, Dmitrijs},
title = {Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations},
year = {2022},
isbn = {9781450398800},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3560830.3563726},
doi = {10.1145/3560830.3563726},
booktitle = {Proceedings of the 15th ACM Workshop on Artificial Intelligence and Security},
pages = {127–136},
numpages = {10},
keywords = {reverse engineering, neural networks, malware, emulation, convolutions},
location = {Los Angeles, CA, USA},
series = {AISec'22}
}

Installation

Model code and Nebula pretrained objects are available as pip package:

pip install git+https://github.com/dtrizna/nebula

Usage Example

from nebula import Nebula

# 0. MODEL SETUP
nebula = Nebula(
    vocab_size = 50000, # pre-trained only for 50k
    seq_len = 512, # pre-trained only for 512
    tokenizer = "bpe" # supports: ["bpe", "whitespace"],
)

# 1. EMULATE IT: SKIP IF YOU HAVE JSON REPORTS ALREADY
pe = r"C:\Windows\System32\calc.exe"
report = nebula.dynamic_analysis_pe_file(pe)

# 2. PREPROCESS EMULATED JSON REPORT AS ARRAY
x_arr = nebula.preprocess(report)

# 3. PASS THROUGH PYTORCH MODEL
prob = nebula.predict_proba(x_arr)

print(f"\n[!] Probability of being malicious: {prob:.3f}")

Running this:

> python3 scripts\nebula_pe_to_preds.py

INFO:root: [!] Successfully loaded pre-trained tokenizer model!
INFO:root: [!] Loaded vocab from <REDACTED>\nebula\objects\bpe_50000_vocab.json
INFO:root: [!] Tokenizer ready!
INFO:root: [!] Model ready!

[!] Probability of being malicious: 0.001

Training Custom Tokenizers

The repository includes pre-trained BPE tokenizer for vocab_size=50000:

  • Model: nebula/objects/bpe_50000_sentencepiece.model
  • Vocabulary: nebula/objects/bpe_50000_vocab.json

For ablation experiments with different vocabulary sizes, you can train custom tokenizers using the dataset from HuggingFace.

Note: Files under scripts/concept-testing/ are development artifacts and should not be used for reproducing paper results. Always use files from nebula/objects/ for the pre-trained 50k configuration.

Training a BPE Tokenizer

from nebula.preprocessing import JSONTokenizerBPE

# Load training corpus - concatenate all JSON reports
import json
from pathlib import Path

reports = []
for f in Path("path/to/reports").glob("*.json"):
    with open(f) as fp:
        reports.append(json.dumps(json.load(fp)))
training_corpus = " ".join(reports)

# Initialize and train tokenizer
tokenizer = JSONTokenizerBPE(
    vocab_size=10000,  # target vocab size
    seq_len=512
)

tokenizer.train(
    jsonData=training_corpus,
    vocab_size=10000,
    model_prefix="bpe_10000_sentencepiece",
    model_type="bpe",
    split_by_number=False,
    spLength=4192
)

# Outputs: bpe_10000_sentencepiece.model, bpe_10000_sentencepiece_vocab.json

Encoding Data with Custom Tokenizer

# Load trained tokenizer
tokenizer = JSONTokenizerBPE(
    vocab_size=10000,
    seq_len=512,
    model_path="bpe_10000_sentencepiece.model",
    vocab="bpe_10000_sentencepiece_vocab.json"
)

# Encode reports
x_encoded = tokenizer.encode(json_reports, pad=True)

Pre-training with Self-Supervised Learning (SSL)

Nebula is capable of learning from unlabeled data using self-supervised learning (SSL) techniques. Extensive evaluation of SSL efficiency and API level interface is a subject of future work.

Masked Language Model

Implementation is under nebula.lit_pretraining.MaskedLanguageModelTrainer class.

Auto-Regressive Language Model

Implementation is under nebula.lit_pretraining.AutoRegressiveModelTrainer class.

About

Transformer Neural Network for Dynamic Malware Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •