-
Notifications
You must be signed in to change notification settings - Fork 60
Custom data + negative samples #46
Description
Hello professor Zhang,
I'm experimenting with a custom dataset using the InMemoryDataset class. My link prediction task involves labeling true edges as those going from chemicals to diseases with a negative correlation (meaning chemical treats disease). My sample graph has some 30k nodes and around 1M edges. If the edge is a negative correlation from chem to disease then it is labeled true. All other edges are labeled false. So the dataset contains a list of edges and a list of corresponding labels, plus node features list and edge features list.
My question regards how the negative sampling works in your code. It appears that negative sampling is done to produce instances for the model recognize as false. But I don't know if my dataset makes sense for this code because it already has false labels. It looks like the code here is expecting all the input edges to be true and then generating negative samples to supplement them. However, my data contains both true and false edges from my data preparation. Do I need to restructure my input dataset? For example, only include true/positive edges in the edge input list ([x, y], [a, b] ... ) and let split_edge dict separate pos/neg edges.
Or is it possible to manually let the model recognize the negative edges that I already labeled? Maybe just sort the neg/pos edges into the split_edge dict manually. I'm also trying to keep a pos/neg ratio of 1:10.
Last question is does this implementation keep node embeddings limited to the training data and not used again in test data to possibly prevent data leakage? My project leader wants to confirm this part.
Any help is appreciated!