You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Experiment: Search for antipodal pair representation of adjectives in gpt2
Based on the Less Wrong article The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable we want to try finding projections conatining pairs of adjectives or pairs of names with antipodal carchteristics.
Experiment Exploratory Analysis:
Direct logit attribution
Layer Attribution
Head Attribution
Patching experiments
Resiudal stream patching
More experiments
Examine the embedding space using SVD
Each of the following experiments should be done in for both polarities of conjunctio expecting in one induction behavior and the other extrapolation.
Experiment: 1
Dataset: Outter semantic relation, (positive and)
Intervention: Change the polarity of the conjunction, and check out logit difference for bouth positive and negative conjunctions.