Possible Further work

supervised (LSTM-) community classifier.
Projection layer is a matrix L such that Lᵢ is a mapping from LSTM features to predictor of a community. (So Lᵢ can be treated as a community embedding.)
Maybe it’s easier to do dependence over time? Dynamic processes can be represented like so:
X = exp (tA) = 1 + tA + (tA)²/2 + … + (tA)ⁿ/n! + …

Where t is time (say in [0..1]) and A is

Note that X = 1 for t=0, so this should probably be another layer on top. The advantage of this treatment is that it cleanly separates change over time to the baseline for t=0. Furthermore, there is a standard wayone can to interpret tA as a transformation.

See: https://www.youtube.com/watch?v=IZqwi0wJovM&list=PL49CF3715CB9EF31D&index=23

Paper outline

Hypotheses to test

Different communities have different levels of linguistic complexity
different language models are able to recognize such differences
level at which the community embedding is taken into account in the language model influences this.
community vectors are correlated with extra-linguistic information.
stability is correlated with language complexity

Definition of Models/Data/Training

Results

Computation of loss/community/model (show all)
Computation of confusion matrix for community inference per model (eg. show only best and worst model) (on test set)
Computation of H(c,m) entropy of community identification (on test set)
Correlation between H(c,m) for various models. Identification of specific communities by hand
Correlation between H(c,m) and Stability(c).

Bonus:

Correlation of community embeddings with “type” of community
Correlation of community embeddings between different models
Note that community embeddings gotten by co-occurence matrix don’t say much

On what level of linguistic analysis is variation is happening between communities?

Experiments:

train transformer model with added community embedding at various levels.
test if the model can predict which community a message is coming from
- we could apply Bayes theorem:
  - P(comm | message) = P (message | comm) * P(message) / P(comm)
  - P(comm) is known
  - P(message) = Σ(c) P(message | c), this is our normalisation factor.
  - P(message | comm) could be read out of the model.
  -> W(i,j) = P (community=i | message) for message in community j –> This is a confusion matrix
- leading idea:
  - to get the community embedding, multiply matrix of community embeddings by a weight W.
    - in training W is a one-hot encoding of the community.
      - Or, if data is the whole training set, W is the identity matrix
    - in testing, let W be optimised (and the rest is fixed)
      - actually it’s convenient to optimise the community embeddings all at once, and let W be a c*c matrix. The ith row is selected when facing a message from community i.
      - Then entropy of W (seen as a distribution) gives a level of confidence about the community.
      - data is: a single input. (and all the things that were parameter before are fixed and thus treated as input data)
  - so W(i) tells you the distribution of communities from the set of messages assigned we assigned to i.
  - entropy of W(i) is low if the community i is very specific from the point of view of our (linguistic) model (so messages can be easily identified as comming from a specific community; ie they get assigned a significantly higher loss if considered from the wrong community.).
  - actually W can be seen as a confusion matrix. So we can ‘boil it’ down to a combined F score measuring the ability to identify a community.
- TODO: scatterplot for 2 models.

Analysis of model quality

P_m(i) = average loss per message, for community i, for model m.
Analysis of difference in information gains for simple/complex models:
Plot P_m(i) vs P_n(i) for each i (gives scatterplot, which shows information gain between m and n for each community.)
- eg. bigram model loss vs. transformer model loss.

Analysis of community embeddings

Correlation with the extra-linguistic properties of communities?.
- for example using a single dense layer + softmax
- Frequency of image posts?
- Mean number of post per user. (or the correponding power law exponent estimator: https://en.wikipedia.org/wiki/Power_law#Maximum_likelihood)
- Proportion of comments which are replies to other comments
- Community/People co-occurence matrix / LSA
  - construct the matrix as follows:
    - for each message posted by user u in community c, M (u,c) += 1
    - apply LSA
    - Done here: https://github.com/trevormartin/shorttails/blob/master/subredditalgebra/subredditalgebra.r
    - https://www.shorttails.io/interactive-map-of-reddit-and-subreddit-similarity-calculator/
Correlation between embeddings of several models.
- Basic idea: check the Pearson correlation beween cosine similarity (c^m_i, c^m_j) for every pair of community (j,i), between to models (m).
- If r=1, then you have a perfect orthogonal mapping
- This is a very strong condition, because all dimensions play some role. Other idea is to do a projection on the n most relevant dimensions first.
  - For n=1 this is checking that the most relevant dimension is the same for both models
  - To do this, do a LSA/SVD decomp of embeddings first (sklearn.decomposition.TruncatedSVD), and truncate at n dimensions. sklearn.decomposition.TruncatedSVD(n_components=n)
  - Then compute Pearson correlation

Open questions:

Where in the architecture should the embedding go in general
Can the topically-driven language model smooth the embedding

Related work:

Topic modelling

Discovering Discrete Latent Topics with Neural Variational Inference https://arxiv.org/pdf/1706.00359.pdf
JeyHan Lau’s work

Pre-trained language models

Unsupervised Domain Clusters in Pretrained Language Models https://www.aclweb.org/anthology/2020.acl-main.692/

Language models for sociolinguistic variation

Del Tredici, M., & Fernández, R. (2017). Semantic Variation in Online Communities of Practice. https://www.aclweb.org/anthology/W17-6804

Possible improvements/expansions

Some kind of probing to characterize what kind of variation is happening.
- Idea: Take a BERT model and cut off a certain layer.
  - Problem: This is almost completely disjoint with what we’ve done before.
- Fine tune BERT on each community and see differences in where the change happens (using TX-Ray method).
A larger set of communities
Goal: See if in a community the community-specific variation is just lexical or if it is syntactic or…
- Idea: take a sentence from a subreddit (one that is quite characteristic) - adversairily make minimal changes to fit to another subreddit
  - Problem: How do you know if you have succeded or not
- Compare LSTM info gain to simpler model (unigram, bigram, …, convolutional?) info gain
  - Related to a method shalom used to quantify acceptability (subtracting out lexical frequency effects)
  - Quantify differences in indiscernibility for the different models: E.g., Are there communities that are much more discernable with the more complex models?

Unigram model

Actual distribution of unigrams for community.

P(w) is the actual unigram frequency of the word w throughout the data

P_c(w) is the actual unigram frequency of the word w in community c.

P_m(w) is the actual unigram frequency of the word w in message m.

Assuming m belongs to c, the predicted distribution for m is given P_c

Let P_c’(w) = α P(w) + (1-α) P_c(w) say α=0.01 (label smoothing)

consider H(P_m,P_c) = -∑_w P_m(w) log P_c(w) –> this be +∞ if P_c(w) = 0 and P_m(w) > 0; i.e. if the word occurs in the message but not in the community.

We predict: P(m | c) = exp(-H(P_m,P_c)) –> this will be 0 in the same conditions

P(m) = exp(-H(P_m,P)) –> never 0 (because the message is part of the ‘training’ data, the above conditions never occur)

P(c | m) ∝ P(m | c) P(m)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Further work

Paper outline

Hypotheses to test

Definition of Models/Data/Training

Results

Bonus:

On what level of linguistic analysis is variation is happening between communities?

Analysis of model quality

Analysis of community embeddings

Open questions:

Related work:

Topic modelling

Pre-trained language models

Language models for sociolinguistic variation

Possible improvements/expansions

Unigram model

FilesExpand file tree

Notes.org

Latest commit

History

Notes.org

File metadata and controls

Possible Further work

Paper outline

Hypotheses to test

Definition of Models/Data/Training

Results

Bonus:

On what level of linguistic analysis is variation is happening between communities?

Analysis of model quality

Analysis of community embeddings

Open questions:

Related work:

Topic modelling

Pre-trained language models

Language models for sociolinguistic variation

Possible improvements/expansions

Unigram model