This repository is part of the Nebula: Self-Attention for Dynamic Malware Analysis publication.
Pre-print available on arxiv.
- All Nebula and alternative dynamic malware analysis models are under
nebula/modelsdirectory. - Examples of usage are under
scripts/directory. - Code related to extracting emualtion traces from raw PE samples is under
emulation/.
Dataset used for experiments and pretraining downloadable from huggingface.co/datasets/dtrizna/quovadis-speakeasy.
Additionally, while not used in this project directly, EMBER feature vectors for the same malware samples are available on huggingface as well: huggingface.co/datasets/dtrizna/quovadis-ember.
This should allow to perform research on static and dynamic detection methodology cross-analyses.
The Speakeasy emulator captures execution across multiple entry points per sample:
module_entry: The main executable entry pointthread: Any threads spawned during emulation
The HuggingFace dataset stores each entry point as a separate row. This means:
- A sample spawning 100 threads = 101 rows in the dataset
- Each row has
ep_typefield indicating its type
| Approach | Behavior |
|---|---|
Nebula class (recommended) |
Automatically aggregates all entry points into single prediction |
| Direct dataset loading | Each entry point treated separately - must aggregate manually |
For correct per-sample predictions when loading dataset directly:
from datasets import load_dataset
dataset = load_dataset("dtrizna/quovadis-speakeasy")
# Option 1: Use only module_entry rows (ignores thread behavior)
module_entries = [row for row in dataset if row['ep_type'] == 'module_entry']
# Option 2: Group by sample using filename from API args, then aggregateAlternatively, use raw JSON reports with PEDynamicFeatureExtractor.filter_and_normalize_report() which concatenates all entry points into a single representation.
If you find this code or data valuable, please cite us:
@ARTICLE{10551436,
author={Trizna, Dmitrijs and Demetrio, Luca and Biggio, Battista and Roli, Fabio},
journal={IEEE Transactions on Information Forensics and Security},
title={Nebula: Self-Attention for Dynamic Malware Analysis},
year={2024},
volume={19},
number={},
pages={6155-6167},
keywords={Malware;Feature extraction;Data models;Analytical models;Long short term memory;Task analysis;Encoding;Malware;transformers;dynamic analysis;convolutional neural networks},
doi={10.1109/TIFS.2024.3409083}}
@inproceedings{10.1145/3560830.3563726,
author = {Trizna, Dmitrijs},
title = {Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations},
year = {2022},
isbn = {9781450398800},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3560830.3563726},
doi = {10.1145/3560830.3563726},
booktitle = {Proceedings of the 15th ACM Workshop on Artificial Intelligence and Security},
pages = {127–136},
numpages = {10},
keywords = {reverse engineering, neural networks, malware, emulation, convolutions},
location = {Los Angeles, CA, USA},
series = {AISec'22}
}
Model code and Nebula pretrained objects are available as pip package:
pip install git+https://github.com/dtrizna/nebulafrom nebula import Nebula
# 0. MODEL SETUP
nebula = Nebula(
vocab_size = 50000, # pre-trained only for 50k
seq_len = 512, # pre-trained only for 512
tokenizer = "bpe" # supports: ["bpe", "whitespace"],
)
# 1. EMULATE IT: SKIP IF YOU HAVE JSON REPORTS ALREADY
pe = r"C:\Windows\System32\calc.exe"
report = nebula.dynamic_analysis_pe_file(pe)
# 2. PREPROCESS EMULATED JSON REPORT AS ARRAY
x_arr = nebula.preprocess(report)
# 3. PASS THROUGH PYTORCH MODEL
prob = nebula.predict_proba(x_arr)
print(f"\n[!] Probability of being malicious: {prob:.3f}")Running this:
> python3 scripts\nebula_pe_to_preds.py
INFO:root: [!] Successfully loaded pre-trained tokenizer model!
INFO:root: [!] Loaded vocab from <REDACTED>\nebula\objects\bpe_50000_vocab.json
INFO:root: [!] Tokenizer ready!
INFO:root: [!] Model ready!
[!] Probability of being malicious: 0.001The repository includes pre-trained BPE tokenizer for vocab_size=50000:
- Model:
nebula/objects/bpe_50000_sentencepiece.model - Vocabulary:
nebula/objects/bpe_50000_vocab.json
For ablation experiments with different vocabulary sizes, you can train custom tokenizers using the dataset from HuggingFace.
Note: Files under
scripts/concept-testing/are development artifacts and should not be used for reproducing paper results. Always use files fromnebula/objects/for the pre-trained 50k configuration.
from nebula.preprocessing import JSONTokenizerBPE
# Load training corpus - concatenate all JSON reports
import json
from pathlib import Path
reports = []
for f in Path("path/to/reports").glob("*.json"):
with open(f) as fp:
reports.append(json.dumps(json.load(fp)))
training_corpus = " ".join(reports)
# Initialize and train tokenizer
tokenizer = JSONTokenizerBPE(
vocab_size=10000, # target vocab size
seq_len=512
)
tokenizer.train(
jsonData=training_corpus,
vocab_size=10000,
model_prefix="bpe_10000_sentencepiece",
model_type="bpe",
split_by_number=False,
spLength=4192
)
# Outputs: bpe_10000_sentencepiece.model, bpe_10000_sentencepiece_vocab.json# Load trained tokenizer
tokenizer = JSONTokenizerBPE(
vocab_size=10000,
seq_len=512,
model_path="bpe_10000_sentencepiece.model",
vocab="bpe_10000_sentencepiece_vocab.json"
)
# Encode reports
x_encoded = tokenizer.encode(json_reports, pad=True)Nebula is capable of learning from unlabeled data using self-supervised learning (SSL) techniques. Extensive evaluation of SSL efficiency and API level interface is a subject of future work.
Implementation is under nebula.lit_pretraining.MaskedLanguageModelTrainer class.
Implementation is under nebula.lit_pretraining.AutoRegressiveModelTrainer class.
