Skip to content

Data Leakage #57

@Placebosaurus

Description

@Placebosaurus

Particularly in the BGL/data_process.py and TBird/data_process.py.
Both scripts sample overlapping moving windows, which are later shuffled.
 
This causes subsequences of normal logs to appear in both the training and testing splits.

Image

A clean temporal split would avoid this issue. Sorting by log date, then splitting, then shuffling within the splits, and discarding abnormal sequences from the train/val sets, would also avoid the model looking into the future during training. If log key distributions shift over time, this could affect metrics.


80% overlap in consecutive BGL windows:

logbert/BGL/data_process.py

Lines 124 to 125 in a845e61

window_size = 5
step_size = 1

logbert/BGL/data_process.py

Lines 145 to 154 in a845e61

# sampling with sliding window
deeplog_df = sliding_window(df[["timestamp", "Label", "EventId", "deltaT"]],
para={"window_size": int(window_size)*60, "step_size": int(step_size) * 60}
)
#########
# Train #
#########
df_normal =deeplog_df[deeplog_df["Label"] == 0]
df_normal = df_normal.sample(frac=1, random_state=12).reset_index(drop=True) #shuffle

50% overlap in consecutive TBird windows:
window_size = 1
step_size = 0.5

# sampling with sliding window
deeplog_df = sliding_window(df[["timestamp", "Label", "EventId", "deltaT"]],
para={"window_size": float(window_size)*60, "step_size": float(step_size) * 60}
)
output_dir += window_name
#########
# Train #
#########
df_normal = deeplog_df[deeplog_df["Label"] == 0]
df_normal = df_normal.sample(frac=1, random_state=12).reset_index(drop=True) #shuffle

In HDFS, there is no overlap between sessions, but the shuffling could still be an issue:

normal_seq = seq[seq["Label"] == 0]["EventSequence"]
normal_seq = normal_seq.sample(frac=1, random_state=20) # shuffle normal data
abnormal_seq = seq[seq["Label"] == 1]["EventSequence"]
normal_len, abnormal_len = len(normal_seq), len(abnormal_seq)
train_len = n if n else int(normal_len * ratio)
print("normal size {0}, abnormal size {1}, training size {2}".format(normal_len, abnormal_len, train_len))
train = normal_seq.iloc[:train_len]
test_normal = normal_seq.iloc[train_len:]
test_abnormal = abnormal_seq

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions