Data Leakage

Particularly in the  `BGL/data_process.py` and `TBird/data_process.py`.
Both scripts sample overlapping moving windows, which are later shuffled. 
 
This causes subsequences of normal logs to appear in both the training and testing splits. 

<img width="500" height= auto alt="Image" src="https://github.com/user-attachments/assets/7f44e651-46e5-4c95-9795-163caf078a4f" />

A clean temporal split would avoid this issue. Sorting by log date, then splitting, then shuffling within the splits, and discarding abnormal sequences from the train/val sets, would also avoid the model looking into the future during training. If log key distributions shift over time, this could affect metrics.

---

80% overlap in consecutive BGL windows: 
https://github.com/HelenGuohx/logbert/blob/a845e612e919a6930065b69a40098eaaac26e1a3/BGL/data_process.py#L124-L125
https://github.com/HelenGuohx/logbert/blob/a845e612e919a6930065b69a40098eaaac26e1a3/BGL/data_process.py#L145-L154
50% overlap in consecutive TBird windows:
https://github.com/HelenGuohx/logbert/blob/a845e612e919a6930065b69a40098eaaac26e1a3/TBird/data_process.py#L110-L111
https://github.com/HelenGuohx/logbert/blob/a845e612e919a6930065b69a40098eaaac26e1a3/TBird/data_process.py#L144-L154

In HDFS, there is no overlap between sessions, but the shuffling could still be an issue:
https://github.com/HelenGuohx/logbert/blob/a845e612e919a6930065b69a40098eaaac26e1a3/HDFS/data_process.py#L91-L101

	# sampling with sliding window
	deeplog_df = sliding_window(df[["timestamp", "Label", "EventId", "deltaT"]],
	para={"window_size": int(window_size)60, "step_size": int(step_size) 60}
	)

	#########
	# Train #
	#########
	df_normal =deeplog_df[deeplog_df["Label"] == 0]
	df_normal = df_normal.sample(frac=1, random_state=12).reset_index(drop=True) #shuffle

	# sampling with sliding window
	deeplog_df = sliding_window(df[["timestamp", "Label", "EventId", "deltaT"]],
	para={"window_size": float(window_size)60, "step_size": float(step_size) 60}
	)
	output_dir += window_name

	#########
	# Train #
	#########
	df_normal = deeplog_df[deeplog_df["Label"] == 0]
	df_normal = df_normal.sample(frac=1, random_state=12).reset_index(drop=True) #shuffle

	normal_seq = seq[seq["Label"] == 0]["EventSequence"]
	normal_seq = normal_seq.sample(frac=1, random_state=20) # shuffle normal data

	abnormal_seq = seq[seq["Label"] == 1]["EventSequence"]
	normal_len, abnormal_len = len(normal_seq), len(abnormal_seq)
	train_len = n if n else int(normal_len * ratio)
	print("normal size {0}, abnormal size {1}, training size {2}".format(normal_len, abnormal_len, train_len))

	train = normal_seq.iloc[:train_len]
	test_normal = normal_seq.iloc[train_len:]
	test_abnormal = abnormal_seq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Leakage #57

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	window_size = 5
	step_size = 1

	window_size = 1
	step_size = 0.5

Data Leakage #57

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions