Skip to content

Latest commit

 

History

History
183 lines (136 loc) · 8.11 KB

File metadata and controls

183 lines (136 loc) · 8.11 KB

Possible Further work

  • supervised (LSTM-) community classifier.
  • Projection layer is a matrix L such that Lᵢ is a mapping from LSTM features to predictor of a community. (So Lᵢ can be treated as a community embedding.)
  • Maybe it’s easier to do dependence over time? Dynamic processes can be represented like so:

    X = exp (tA) = 1 + tA + (tA)²/2 + … + (tA)ⁿ/n! + …

    Where t is time (say in [0..1]) and A is

    Note that X = 1 for t=0, so this should probably be another layer on top. The advantage of this treatment is that it cleanly separates change over time to the baseline for t=0. Furthermore, there is a standard wayone can to interpret tA as a transformation.

    See: https://www.youtube.com/watch?v=IZqwi0wJovM&list=PL49CF3715CB9EF31D&index=23

Paper outline

Hypotheses to test

  • Different communities have different levels of linguistic complexity
  • different language models are able to recognize such differences
  • level at which the community embedding is taken into account in the language model influences this.
  • community vectors are correlated with extra-linguistic information.
  • stability is correlated with language complexity

Definition of Models/Data/Training

Results

  • Computation of loss/community/model (show all)
  • Computation of confusion matrix for community inference per model (eg. show only best and worst model) (on test set)
  • Computation of H(c,m) entropy of community identification (on test set)
  • Correlation between H(c,m) for various models. Identification of specific communities by hand
  • Correlation between H(c,m) and Stability(c).

Bonus:

  • Correlation of community embeddings with “type” of community
  • Correlation of community embeddings between different models
  • Note that community embeddings gotten by co-occurence matrix don’t say much

On what level of linguistic analysis is variation is happening between communities?

Experiments:

  • train transformer model with added community embedding at various levels.
  • test if the model can predict which community a message is coming from
    • we could apply Bayes theorem:
      • P(comm | message) = P (message | comm) * P(message) / P(comm)
      • P(comm) is known
      • P(message) = Σ(c) P(message | c), this is our normalisation factor.
      • P(message | comm) could be read out of the model.

      -> W(i,j) = P (community=i | message) for message in community j –> This is a confusion matrix

    • leading idea:
      • to get the community embedding, multiply matrix of community embeddings by a weight W.
        • in training W is a one-hot encoding of the community.
          • Or, if data is the whole training set, W is the identity matrix
        • in testing, let W be optimised (and the rest is fixed)
          • actually it’s convenient to optimise the community embeddings all at once, and let W be a c*c matrix. The ith row is selected when facing a message from community i.
          • Then entropy of W (seen as a distribution) gives a level of confidence about the community.
          • data is: a single input. (and all the things that were parameter before are fixed and thus treated as input data)
      • so W(i) tells you the distribution of communities from the set of messages assigned we assigned to i.
      • entropy of W(i) is low if the community i is very specific from the point of view of our (linguistic) model (so messages can be easily identified as comming from a specific community; ie they get assigned a significantly higher loss if considered from the wrong community.).
      • actually W can be seen as a confusion matrix. So we can ‘boil it’ down to a combined F score measuring the ability to identify a community.
    • TODO: scatterplot for 2 models.

Analysis of model quality

  • P_m(i) = average loss per message, for community i, for model m.
  • Analysis of difference in information gains for simple/complex models:
  • Plot P_m(i) vs P_n(i) for each i (gives scatterplot, which shows information gain between m and n for each community.)
    • eg. bigram model loss vs. transformer model loss.

Analysis of community embeddings

  • Correlation with the extra-linguistic properties of communities?.
  • Correlation between embeddings of several models.
    • Basic idea: check the Pearson correlation beween cosine similarity (c^m_i, c^m_j) for every pair of community (j,i), between to models (m).
    • If r=1, then you have a perfect orthogonal mapping
    • This is a very strong condition, because all dimensions play some role. Other idea is to do a projection on the n most relevant dimensions first.
      • For n=1 this is checking that the most relevant dimension is the same for both models
      • To do this, do a LSA/SVD decomp of embeddings first (sklearn.decomposition.TruncatedSVD), and truncate at n dimensions. sklearn.decomposition.TruncatedSVD(n_components=n)
      • Then compute Pearson correlation

Open questions:

  • Where in the architecture should the embedding go in general
  • Can the topically-driven language model smooth the embedding

Related work:

Topic modelling

Pre-trained language models

Language models for sociolinguistic variation

Possible improvements/expansions

  • Some kind of probing to characterize what kind of variation is happening.
    • Idea: Take a BERT model and cut off a certain layer.
      • Problem: This is almost completely disjoint with what we’ve done before.
    • Fine tune BERT on each community and see differences in where the change happens (using TX-Ray method).
  • A larger set of communities
  • Goal: See if in a community the community-specific variation is just lexical or if it is syntactic or…
    • Idea: take a sentence from a subreddit (one that is quite characteristic) - adversairily make minimal changes to fit to another subreddit
      • Problem: How do you know if you have succeded or not
    • Compare LSTM info gain to simpler model (unigram, bigram, …, convolutional?) info gain
      • Related to a method shalom used to quantify acceptability (subtracting out lexical frequency effects)
      • Quantify differences in indiscernibility for the different models: E.g., Are there communities that are much more discernable with the more complex models?

Unigram model

Actual distribution of unigrams for community.

P(w) is the actual unigram frequency of the word w throughout the data

P_c(w) is the actual unigram frequency of the word w in community c.

P_m(w) is the actual unigram frequency of the word w in message m.

Assuming m belongs to c, the predicted distribution for m is given P_c

Let P_c’(w) = α P(w) + (1-α) P_c(w) say α=0.01 (label smoothing)

consider H(P_m,P_c) = -∑_w P_m(w) log P_c(w) –> this be +∞ if P_c(w) = 0 and P_m(w) > 0; i.e. if the word occurs in the message but not in the community.

We predict: P(m | c) = exp(-H(P_m,P_c)) –> this will be 0 in the same conditions

P(m) = exp(-H(P_m,P)) –> never 0 (because the message is part of the ‘training’ data, the above conditions never occur)

P(c | m) ∝ P(m | c) P(m)