-
Notifications
You must be signed in to change notification settings - Fork 130
Description
Particularly in the BGL/data_process.py and TBird/data_process.py.
Both scripts sample overlapping moving windows, which are later shuffled.
This causes subsequences of normal logs to appear in both the training and testing splits.
A clean temporal split would avoid this issue. Sorting by log date, then splitting, then shuffling within the splits, and discarding abnormal sequences from the train/val sets, would also avoid the model looking into the future during training. If log key distributions shift over time, this could affect metrics.
80% overlap in consecutive BGL windows:
Lines 124 to 125 in a845e61
| window_size = 5 | |
| step_size = 1 |
Lines 145 to 154 in a845e61
| # sampling with sliding window | |
| deeplog_df = sliding_window(df[["timestamp", "Label", "EventId", "deltaT"]], | |
| para={"window_size": int(window_size)*60, "step_size": int(step_size) * 60} | |
| ) | |
| ######### | |
| # Train # | |
| ######### | |
| df_normal =deeplog_df[deeplog_df["Label"] == 0] | |
| df_normal = df_normal.sample(frac=1, random_state=12).reset_index(drop=True) #shuffle |
50% overlap in consecutive TBird windows:
Lines 110 to 111 in a845e61
| window_size = 1 | |
| step_size = 0.5 |
Lines 144 to 154 in a845e61
| # sampling with sliding window | |
| deeplog_df = sliding_window(df[["timestamp", "Label", "EventId", "deltaT"]], | |
| para={"window_size": float(window_size)*60, "step_size": float(step_size) * 60} | |
| ) | |
| output_dir += window_name | |
| ######### | |
| # Train # | |
| ######### | |
| df_normal = deeplog_df[deeplog_df["Label"] == 0] | |
| df_normal = df_normal.sample(frac=1, random_state=12).reset_index(drop=True) #shuffle |
In HDFS, there is no overlap between sessions, but the shuffling could still be an issue:
Lines 91 to 101 in a845e61
| normal_seq = seq[seq["Label"] == 0]["EventSequence"] | |
| normal_seq = normal_seq.sample(frac=1, random_state=20) # shuffle normal data | |
| abnormal_seq = seq[seq["Label"] == 1]["EventSequence"] | |
| normal_len, abnormal_len = len(normal_seq), len(abnormal_seq) | |
| train_len = n if n else int(normal_len * ratio) | |
| print("normal size {0}, abnormal size {1}, training size {2}".format(normal_len, abnormal_len, train_len)) | |
| train = normal_seq.iloc[:train_len] | |
| test_normal = normal_seq.iloc[train_len:] | |
| test_abnormal = abnormal_seq |