Project Nebula

Description

This repository is part of the Nebula: Self-Attention for Dynamic Malware Analysis publication.
Pre-print available on arxiv.

Repository Structure

All Nebula and alternative dynamic malware analysis models are under nebula/models directory.
Examples of usage are under scripts/ directory.
Code related to extracting emualtion traces from raw PE samples is under emulation/.

Datasets

Dataset used for experiments and pretraining downloadable from huggingface.co/datasets/dtrizna/quovadis-speakeasy.

Additionally, while not used in this project directly, EMBER feature vectors for the same malware samples are available on huggingface as well: huggingface.co/datasets/dtrizna/quovadis-ember.

This should allow to perform research on static and dynamic detection methodology cross-analyses.

Important: Dataset Structure and Entry Points

The Speakeasy emulator captures execution across multiple entry points per sample:

module_entry: The main executable entry point
thread: Any threads spawned during emulation

The HuggingFace dataset stores each entry point as a separate row. This means:

A sample spawning 100 threads = 101 rows in the dataset
Each row has ep_type field indicating its type

Approach	Behavior
`Nebula` class (recommended)	Automatically aggregates all entry points into single prediction
Direct dataset loading	Each entry point treated separately - must aggregate manually

For correct per-sample predictions when loading dataset directly:

from datasets import load_dataset

dataset = load_dataset("dtrizna/quovadis-speakeasy")

# Option 1: Use only module_entry rows (ignores thread behavior)
module_entries = [row for row in dataset if row['ep_type'] == 'module_entry']

# Option 2: Group by sample using filename from API args, then aggregate

Alternatively, use raw JSON reports with PEDynamicFeatureExtractor.filter_and_normalize_report() which concatenates all entry points into a single representation.

Citation

If you find this code or data valuable, please cite us:

@ARTICLE{10551436,
  author={Trizna, Dmitrijs and Demetrio, Luca and Biggio, Battista and Roli, Fabio},
  journal={IEEE Transactions on Information Forensics and Security}, 
  title={Nebula: Self-Attention for Dynamic Malware Analysis}, 
  year={2024},
  volume={19},
  number={},
  pages={6155-6167},
  keywords={Malware;Feature extraction;Data models;Analytical models;Long short term memory;Task analysis;Encoding;Malware;transformers;dynamic analysis;convolutional neural networks},
  doi={10.1109/TIFS.2024.3409083}}


@inproceedings{10.1145/3560830.3563726,
author = {Trizna, Dmitrijs},
title = {Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations},
year = {2022},
isbn = {9781450398800},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3560830.3563726},
doi = {10.1145/3560830.3563726},
booktitle = {Proceedings of the 15th ACM Workshop on Artificial Intelligence and Security},
pages = {127–136},
numpages = {10},
keywords = {reverse engineering, neural networks, malware, emulation, convolutions},
location = {Los Angeles, CA, USA},
series = {AISec'22}
}

Installation

Model code and Nebula pretrained objects are available as pip package:

pip install git+https://github.com/dtrizna/nebula

Usage Example

from nebula import Nebula

# 0. MODEL SETUP
nebula = Nebula(
    vocab_size = 50000, # pre-trained only for 50k
    seq_len = 512, # pre-trained only for 512
    tokenizer = "bpe" # supports: ["bpe", "whitespace"],
)

# 1. EMULATE IT: SKIP IF YOU HAVE JSON REPORTS ALREADY
pe = r"C:\Windows\System32\calc.exe"
report = nebula.dynamic_analysis_pe_file(pe)

# 2. PREPROCESS EMULATED JSON REPORT AS ARRAY
x_arr = nebula.preprocess(report)

# 3. PASS THROUGH PYTORCH MODEL
prob = nebula.predict_proba(x_arr)

print(f"\n[!] Probability of being malicious: {prob:.3f}")

Running this:

> python3 scripts\nebula_pe_to_preds.py

INFO:root: [!] Successfully loaded pre-trained tokenizer model!
INFO:root: [!] Loaded vocab from <REDACTED>\nebula\objects\bpe_50000_vocab.json
INFO:root: [!] Tokenizer ready!
INFO:root: [!] Model ready!

[!] Probability of being malicious: 0.001

Training Custom Tokenizers

The repository includes pre-trained BPE tokenizer for vocab_size=50000:

Model: nebula/objects/bpe_50000_sentencepiece.model
Vocabulary: nebula/objects/bpe_50000_vocab.json

For ablation experiments with different vocabulary sizes, you can train custom tokenizers using the dataset from HuggingFace.

Note: Files under scripts/concept-testing/ are development artifacts and should not be used for reproducing paper results. Always use files from nebula/objects/ for the pre-trained 50k configuration.

Training a BPE Tokenizer

from nebula.preprocessing import JSONTokenizerBPE

# Load training corpus - concatenate all JSON reports
import json
from pathlib import Path

reports = []
for f in Path("path/to/reports").glob("*.json"):
    with open(f) as fp:
        reports.append(json.dumps(json.load(fp)))
training_corpus = " ".join(reports)

# Initialize and train tokenizer
tokenizer = JSONTokenizerBPE(
    vocab_size=10000,  # target vocab size
    seq_len=512
)

tokenizer.train(
    jsonData=training_corpus,
    vocab_size=10000,
    model_prefix="bpe_10000_sentencepiece",
    model_type="bpe",
    split_by_number=False,
    spLength=4192
)

# Outputs: bpe_10000_sentencepiece.model, bpe_10000_sentencepiece_vocab.json

Encoding Data with Custom Tokenizer

# Load trained tokenizer
tokenizer = JSONTokenizerBPE(
    vocab_size=10000,
    seq_len=512,
    model_path="bpe_10000_sentencepiece.model",
    vocab="bpe_10000_sentencepiece_vocab.json"
)

# Encode reports
x_encoded = tokenizer.encode(json_reports, pad=True)

Pre-training with Self-Supervised Learning (SSL)

Nebula is capable of learning from unlabeled data using self-supervised learning (SSL) techniques. Extensive evaluation of SSL efficiency and API level interface is a subject of future work.

Masked Language Model

Implementation is under nebula.lit_pretraining.MaskedLanguageModelTrainer class.

Auto-Regressive Language Model

Implementation is under nebula.lit_pretraining.AutoRegressiveModelTrainer class.

Name		Name	Last commit message	Last commit date
Latest commit History 359 Commits
emulation		emulation
img		img
nebula		nebula
scripts		scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Nebula

Description

Repository Structure

Datasets

Important: Dataset Structure and Entry Points

Citation

Installation

Usage Example

Training Custom Tokenizers

Training a BPE Tokenizer

Encoding Data with Custom Tokenizer

Pre-training with Self-Supervised Learning (SSL)

Masked Language Model

Auto-Regressive Language Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

dtrizna/nebula

Folders and files

Latest commit

History

Repository files navigation

Project Nebula

Description

Repository Structure

Datasets

Important: Dataset Structure and Entry Points

Citation

Installation

Usage Example

Training Custom Tokenizers

Training a BPE Tokenizer

Encoding Data with Custom Tokenizer

Pre-training with Self-Supervised Learning (SSL)

Masked Language Model

Auto-Regressive Language Model

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages