- supervised (LSTM-) community classifier.
- Projection layer is a matrix L such that Lᵢ is a mapping from LSTM features to predictor of a community. (So Lᵢ can be treated as a community embedding.)
- Maybe it’s easier to do dependence over time? Dynamic processes can
be represented like so:
X = exp (tA) = 1 + tA + (tA)²/2 + … + (tA)ⁿ/n! + …
Where t is time (say in [0..1]) and A is
Note that X = 1 for t=0, so this should probably be another layer on top. The advantage of this treatment is that it cleanly separates change over time to the baseline for t=0. Furthermore, there is a standard wayone can to interpret tA as a transformation.
See: https://www.youtube.com/watch?v=IZqwi0wJovM&list=PL49CF3715CB9EF31D&index=23
- Different communities have different levels of linguistic complexity
- different language models are able to recognize such differences
- level at which the community embedding is taken into account in the language model influences this.
- community vectors are correlated with extra-linguistic information.
- stability is correlated with language complexity
- Computation of loss/community/model (show all)
- Computation of confusion matrix for community inference per model (eg. show only best and worst model) (on test set)
- Computation of H(c,m) entropy of community identification (on test set)
- Correlation between H(c,m) for various models. Identification of specific communities by hand
- Correlation between H(c,m) and Stability(c).
- Correlation of community embeddings with “type” of community
- Correlation of community embeddings between different models
- Note that community embeddings gotten by co-occurence matrix don’t say much
Experiments:
- train transformer model with added community embedding at various levels.
- test if the model can predict which community a message is coming from
- we could apply Bayes theorem:
- P(comm | message) = P (message | comm) * P(message) / P(comm)
- P(comm) is known
- P(message) = Σ(c) P(message | c), this is our normalisation factor.
- P(message | comm) could be read out of the model.
-> W(i,j) = P (community=i | message) for message in community j –> This is a confusion matrix
- leading idea:
- to get the community embedding, multiply matrix of community
embeddings by a weight W.
- in training W is a one-hot encoding of the community.
- Or, if data is the whole training set, W is the identity matrix
- in testing, let W be optimised (and the rest is fixed)
- actually it’s convenient to optimise the community embeddings all at once, and let W be a c*c matrix. The ith row is selected when facing a message from community i.
- Then entropy of W (seen as a distribution) gives a level of confidence about the community.
- data is: a single input. (and all the things that were parameter before are fixed and thus treated as input data)
- in training W is a one-hot encoding of the community.
- so W(i) tells you the distribution of communities from the set of messages assigned we assigned to i.
- entropy of W(i) is low if the community i is very specific from the point of view of our (linguistic) model (so messages can be easily identified as comming from a specific community; ie they get assigned a significantly higher loss if considered from the wrong community.).
- actually W can be seen as a confusion matrix. So we can ‘boil it’ down to a combined F score measuring the ability to identify a community.
- to get the community embedding, multiply matrix of community
embeddings by a weight W.
- TODO: scatterplot for 2 models.
- we could apply Bayes theorem:
- P_m(i) = average loss per message, for community i, for model m.
- Analysis of difference in information gains for simple/complex models:
- Plot P_m(i) vs P_n(i) for each i (gives scatterplot, which shows
information gain between m and n for each community.)
- eg. bigram model loss vs. transformer model loss.
- Correlation with the extra-linguistic properties of communities?.
- for example using a single dense layer + softmax
- Frequency of image posts?
- Mean number of post per user. (or the correponding power law exponent estimator: https://en.wikipedia.org/wiki/Power_law#Maximum_likelihood)
- Proportion of comments which are replies to other comments
- Community/People co-occurence matrix / LSA
- construct the matrix as follows:
- for each message posted by user u in community c, M (u,c) += 1
- apply LSA
- Done here: https://github.com/trevormartin/shorttails/blob/master/subredditalgebra/subredditalgebra.r
- https://www.shorttails.io/interactive-map-of-reddit-and-subreddit-similarity-calculator/
- construct the matrix as follows:
- Correlation between embeddings of several models.
- Basic idea: check the Pearson correlation beween cosine similarity (c^m_i, c^m_j) for every pair of community (j,i), between to models (m).
- If r=1, then you have a perfect orthogonal mapping
- This is a very strong condition, because all dimensions play some role.
Other idea is to do a projection on the n most relevant dimensions first.
- For n=1 this is checking that the most relevant dimension is the same for both models
- To do this, do a LSA/SVD decomp of embeddings first (sklearn.decomposition.TruncatedSVD), and truncate at n dimensions. sklearn.decomposition.TruncatedSVD(n_components=n)
- Then compute Pearson correlation
- Where in the architecture should the embedding go in general
- Can the topically-driven language model smooth the embedding
- Discovering Discrete Latent Topics with Neural Variational Inference https://arxiv.org/pdf/1706.00359.pdf
- JeyHan Lau’s work
- Unsupervised Domain Clusters in Pretrained Language Models https://www.aclweb.org/anthology/2020.acl-main.692/
- Del Tredici, M., & Fernández, R. (2017). Semantic Variation in Online Communities of Practice. https://www.aclweb.org/anthology/W17-6804
- Some kind of probing to characterize what kind of variation is happening.
- Idea: Take a BERT model and cut off a certain layer.
- Problem: This is almost completely disjoint with what we’ve done before.
- Fine tune BERT on each community and see differences in where the change happens (using TX-Ray method).
- Idea: Take a BERT model and cut off a certain layer.
- A larger set of communities
- Goal: See if in a community the community-specific variation is just lexical or if it is syntactic or…
- Idea: take a sentence from a subreddit (one that is quite characteristic) - adversairily make minimal changes to fit to another subreddit
- Problem: How do you know if you have succeded or not
- Compare LSTM info gain to simpler model (unigram, bigram, …, convolutional?) info gain
- Related to a method shalom used to quantify acceptability (subtracting out lexical frequency effects)
- Quantify differences in indiscernibility for the different models: E.g., Are there communities that are much more discernable with the more complex models?
- Idea: take a sentence from a subreddit (one that is quite characteristic) - adversairily make minimal changes to fit to another subreddit
Actual distribution of unigrams for community.
P(w) is the actual unigram frequency of the word w throughout the data
P_c(w) is the actual unigram frequency of the word w in community c.
P_m(w) is the actual unigram frequency of the word w in message m.
Assuming m belongs to c, the predicted distribution for m is given P_c
Let P_c’(w) = α P(w) + (1-α) P_c(w) say α=0.01 (label smoothing)
consider H(P_m,P_c) = -∑_w P_m(w) log P_c(w) –> this be +∞ if P_c(w) = 0 and P_m(w) > 0; i.e. if the word occurs in the message but not in the community.
We predict: P(m | c) = exp(-H(P_m,P_c)) –> this will be 0 in the same conditions
P(m) = exp(-H(P_m,P)) –> never 0 (because the message is part of the ‘training’ data, the above conditions never occur)
P(c | m) ∝ P(m | c) P(m)