Skip to content

Latest commit

 

History

History
66 lines (53 loc) · 2.23 KB

File metadata and controls

66 lines (53 loc) · 2.23 KB

Balanced Sentence Experiments

Previous experiments


Experiment: Search for antipodal pair representation of adjectives in gpt2

Based on the Less Wrong article The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable we want to try finding projections conatining pairs of adjectives or pairs of names with antipodal carchteristics.

Experiment Exploratory Analysis:

  • Direct logit attribution
  • Layer Attribution
  • Head Attribution

Patching experiments

  • Resiudal stream patching

More experiments

  • Examine the embedding space using SVD

Each of the following experiments should be done in for both polarities of conjunctio expecting in one induction behavior and the other extrapolation.


Experiment: 1

Dataset: Outter semantic relation, (positive and)

Intervention: Change the polarity of the conjunction, and check out logit difference for bouth positive and negative conjunctions.


Experiment: 2

Dataset: Outter semantic relation (gender sensitive)

Intervention: Change the polarity of the conjunction


Experiment: 3

Dataset: Inner semantic realtion of both noun and adjective

Intervention: Change the polarity of the conjunction


Experiment: 4

Dataset: Inner semantic realtion of both noun and adjective gender sensitive

Intervention: Change the polarity of the conjunction


Experiment: 5

Dataset: Inner semantic relation of adjectives

Intervention: Change the polarity of the conjunction


Experiment: 6

Dataset: Outter semantic relation

Intervention: Interchange the nouns


Experiment: 7

Dataset: Outter semantic relation (gender sensitive)

Intervention: Interchange the nouns


Experiment: 8

Dataset: Inner semantic relation of both nouns and adjectives

Intervention: Interchange one of the nouns by a random noun


Experiment: 9

Dataset: Inner semantic realtion of both noun and adjective gender sensitive

Intervention: Interchange one of the nouns by a random noun