Skip to content

Disviel/TAMSelector

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TAMSelector

This is the codebase for the paper "Less is More: Boosting Vulnerability Detection via Target-Aware and Multi-Space Data Selection".

Introduction

TAMSelector is a novel Target-Aware and Multi-space data Selector that identifies the most valuable training samples across three complementary spaces - code space, representation space, and activation space - to enhance both the efficiency and effectiveness of fine-tuning in source code vulnerability detection tasks.

Installation

First, clone the TAMSelector repository.

git clone https://github.com/TAMSelector/TAMSelector.git
cd TAMSelector

Then, enter your Python virtual environment and install the dependencies.

pip install -r requirements.txt

Then you can import TAMSelector in your .py file.

from tam_selector import TAMSelector, TAMDataset, KSAE

Usage

We provide a demo.py to demonstrate the basic usage of TAMSelector.

from tam_selector import TAMSelector, TAMDataset, KSAE
from transformers import AutoModel, AutoTokenizer

# Step.1 Specify the model you want to use
MODEL = "microsoft/codebert-base"

# Step.2 Use TAMDataset to read the src and tgt datasets. Note that they must be json files.
src_dataset = TAMDataset(data_path="./testcase/src_dataset.json")
tgt_dataset = TAMDataset(data_path="./testcase/tgt_dataset.json")

# Step.3 Train the K-Sphere Auto Encoder (KSAE)
# The model and training data can be specified as needed. Here, the src dataset is used for convenience.
# The ksae is specified by its name. If a ksae with the same name already exists, it will not be retrained.
# By default, the ksae is saved under ./outputs/models/{name}.
KSAE.train(
    model=AutoModel.from_pretrained(MODEL),
    tokenizer=AutoTokenizer.from_pretrained(MODEL),
    dataset=src_dataset,
    k=512,
    name="codebert-demo-ksae",
)

# Step.4 Load the trained ksae and specify k and the layer.
ksae = KSAE("codebert-demo-ksae", k=512, layer=11)

# Step.5 Create a TAMSelector object and specify the output path.
selector = TAMSelector(
    model=AutoModel.from_pretrained(MODEL),
    tokenizer=AutoTokenizer.from_pretrained(MODEL),
    src_dataset=src_dataset,
    tgt_dataset=tgt_dataset,
    outputs_prefix="./outputs/demo",
)

# Step.6 Sequentially execute tasks for the activation space, representation space, and code space, followed by the fusion task.
selector.run_activation_space(ksae=ksae, topk=16)
selector.run_representation_space(topk=16)
selector.run_code_space(lang="c", topk=16)
selector.fusion(act_topk=16, rep_topk=16, cod_topk=16, strategy="intersection")

# Step.7 If you prefer not to execute step-by-step, you can directly use easy_run, which is equivalent to the step-by-step method.
selector.easy_run(ksae=ksae, topk=2, lang="c", strategy="union")

# If you want to get the json data for a specific space, you can save it as follows.
TAMDataset(
    data_path="./testcase/src_dataset.json",
    topk_dict="./outputs/demo/activation_space/top2_dict.json",
).save("./demo_act_space_top2.json")

Related Repositories

https://github.com/EleutherAI/sparsify

https://github.com/k4black/codebleu

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%