adsieg.github.io/part_1.Rmd at master · adsieg/adsieg.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
---
title: "Network Graph Embeddings is all we need in natural language understanding *[Part 1]*"
author: "Adrien Sieg"
output:
  html_document:
    theme: yeti
    highlight: haddock
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: true
---

<center>
![](neo4j_base.gif)
</center>

This paper brings to light how **network embedding graphs** can help to solve major open problems in natural language understanding. One illustrative element in NLP could be **language variability** - or **ambiguity problem** - which happens when two sentences express the same meaning (or ideas) with very different words. For instance we may say almost interchangeably: *<font color="blue">"where is the nearest sushi restaurant?"</font>* or *<font color="blue">"can you please give me addresses of sushi places nearby?"</font>*. These two sentences exactly share the same meaning with a different semantic wording. Here is the big challenge that we are struggling. In terms of data science, it bears witness of a well-known problem called **text similarity**. Indeed, my sparse vectors for the 2 sentences have no common words and consequently will have a cosine distance of 1. This is a terrible distance score because the 2 sentences have very similar meanings.

The first thing that is crossing any data scientists'mind would have been to use popular **document embedding methods** based on **similarity measures** such as **Doc2Vec**, **Average w2v vectors**, **Weighted average w2v vectors** (e.g. tf-idf), **RNN-based embeddings (e.g. deep LSTM networks)**, ... to cope with this text similarity challenge.

As for us, we will tackle this text similarity challenge by implementing **network graph embeddings** in light of **traditional word embeddings technics**.

We will play from two pieces of information of an unique dataset:

- We have news - written in English according many topics. This is a **text** format, that is to say a series of paragraphs, sentences and words. Here is the <font color="blue">micro level</font>. **We will implement graph of words.** Consequently, we will not consider the context of news in order to emphasise words themselves.

- In addition, we have information regarding the context of news - when they are published, where they come from, from which category they fall down, who write them, ... Here is a <font color="blue">macro level</font>. We will implement graph of context. Here, we will not consider words.

The objectives is to play with these two levels in order to classify news at best.

```{r, include=FALSE}
library(visNetwork)
library(kableExtra)
library(ggplot2)
library(dplyr)
library(tidytext)
library(janeaustenr)
library(igraph)
library(tidyr)
library(ggraph)
library(data.table)
library(scales)
library(ggplot2)
```

# Word Embeddings

In very basic terms, word embeddings turns corpus text into numerical vectors. Consequently two different words - sharing in common a same semantic similarity - are close in term of Euclidean distance into a given high dimensional space. Words that have the same meaning have a similar representation - or a very close numerical vectors.

## Pioneering paper

It has been shown (through the pionneering paper of **Abhik Jana** and **Pawan Goyal** on **<font color="blue">"Can Network Embedding of Distributional Thesaurus be Combined with Word Vectors for Better Representation?"</font>**) in which they explained that "Learning word representations is one of the basic and primary steps in understanding text and there are predominantly **two views of learning word representations**."

- "In one realm of representation,**words are vectors of distributions** obtained from analyzing their contexts in the text and two words are considered meaningfully **similar** if the vectors of those words are **close in the euclidean
space**. Here are dense representation of words such as predictive model like **Word2vec** (Mikolov et al, 2013) or **count-based model** like **GloVe** (Pennington et al., 2014)".

- Otherwise, "talks about **network like structure** where **two words are considered neighbors** if they both **occur in the same context** above a certain number of times. **The words are finally represented using these neighbors.** Distributional Thesaurus network is one such instance of this type, the notion of which was used in early work about distributional semantics (Grefenstette, 2012; Lin, 1998; Curran and Moens, 2002)."

The goal of this paper is to turn "**a Distributional Thesaurus (DT) network** into **word embeddings** by applying efficient network embedding methods and analyze how these embeddings generated from DT network can improve the representations generated from **prediction-based model** like Word2vec or dense count based semantic model like GloVe. We experiment with several combination techniques and find that <font color="blue">**DT network embedding can be combined with Word2vec and GloVe** to outperform the performances when used independently</font>."

- **<font color="green">Representation learning</font>** is when we use machine learning techniques to derive data representation

- **<font color="green">Distributed representation</font>** is different from one-hot representation because it uses dense vectors to represent data points

- **<font color="green">Embedding</font>** is mapping information entities into a low-dimensional space

## Paradigmatic Similarity

We are born with the intention of implementing some Natural Language Processing (NLP) within Graph Databases and Neo4j. First, few words when it comes to Neo4j which is a graph database management system developed by Neo4j. The underlying concept is very simple: everything is stored in the form of either an **edge**, a **node**, or an **attribute**. Each node and edge can have any number of attributes. Both the nodes and edges can be **labelled**.

Let's consider two sentences:

$S_{1}$ = Where is the nearest sushi restaurant?

$S_{2}$ = Can you please give me addresses of sushi places nearby?

- **Very basic transformation** : Remove stopwords such as "is", "the","can","you","please","me","of", ...

$S_{1}$ = {"Where","nearest","sushi","restaurant"}

$S_{2}$ = {"give","addresses","sushi","places","nearby"}

- **Each word is represented by its context**

left("sushi") = {"where","nearest","give","addresses"}

right("sushi") = {"restaurant","places","nearby"}

Here is the **Cypher Query** to put our two sentences into **neo4j graph database**:

```{}
WITH split(tolower("Where nearest sushi restaurant", "") AS text
UNWIND range(0, size(text)-2) AS i
MERGE (w1:Word {name: text[i]})
MERGE (w2: Word {name: text[i+1]})
MERGE (w1)-[:NEXT]-> (w2)
```

```{r, echo=FALSE}
nodes <- data.frame(id = c("Where","nearest","sushi","restaurant","give","addresses","places","nearby"),
                    label = c("Where","nearest","sushi","restaurant","give","addresses","places","nearby"),
                    group = c("S_1", "S_1", "S_2","S_1","S_3","S_3","S_3","S_3"))

edges <- data.frame(from = c("Where","nearest","sushi","give","addresses","sushi","places"),
                    to = c("nearest","sushi","restaurant","addresses","sushi","places","nearby"))

visNetwork(nodes, edges, main = "Representing text as a graph", height = "500px", width = "100%") %>%
  visOptions(manipulation = TRUE) %>%
  visEdges(arrows = "to") %>%
  visLayout(randomSeed = 123)
```

- Words with high context similarity likely have paradigmatic relation

Let's consider two news sentences:

$S_{1}$ = My boss eats sushi on Friday

$S_{2}$ = My brother eats pizza on Sunday evening

```{}
Sim("boss","brother") =
    Sim(left("boss"), right("boss")) +
    Sim(left("brother"), right("brother"))
```

```{r, echo=FALSE}
nodes <- data.frame(id = c("boss","eats","sushi","Friday","brother","pizza","Sunday","evening"),
                    label = c("boss","eats","sushi","Friday","brother","pizza","Sunday","evening"),
                    group = c("S_1", "S_2", "S_1","S_1","S_3","S_3","S_3","S_3"))

edges <- data.frame(from = c("boss","eats","sushi","brother","eats","pizza","Sunday"),
                    to = c("eats","sushi","Friday","eats","pizza","Sunday", "evening"))

visNetwork(nodes, edges, main = "Context similarity and Paradigmatic relationships", height = "500px", width = "100%") %>%
  visOptions(manipulation = TRUE) %>%
  visEdges(arrows = "to") %>%
  visLayout(randomSeed = 123)
```

# Building a Network graph of news

## Objective : Transforming text corpus into graph

Let's consider https://newsapi.org/ which is a simple and easy-to-use API that returns JSON metadata for headlines and articles live all over the web right now. News API indexes articles from over 30,000 worldwide sources. To be honest, I am highly interested in text - that is to say the variable of headlines.

But let's keep other columns to play with graph and maybe find some interesting things...

```{r, echo=FALSE}
setwd("C:/Users/adsieg/Desktop/part_1")
news_data<-fread("news.csv", sep = ",", header= TRUE)

news_data$category<-as.factor(news_data$category)

kable(news_data[c(1:8),c(1:5)]) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

Let's switch **from** a linear and static **SQL** dataframe **to** a dynamic **NoSQL** database - i.e. from relational to non-relational database.

As you can spot, there are 9 variables within our dataframe coming from newsAPI. The job is to transform individual column into a relational data model. Let's take an instance to make a good start.

- "author" variable is turning into **<font color="red">red pen logo</font>** inside our new NoSQL database

- "description" variable is turning into **<font color="orange">orange comments logo</font>** inside our new NoSQL database

- "publishedAt" variableis turning into **<font color="green">green clock logo</font>** inside our new NoSQL database

- "source" variable is turning into **<font color="yellow">yellow compass logo</font>** inside our new NoSQL database

- "category" variable is turning into **<font color="black">black compass logo</font>** inside our new NoSQL database

- "title" variable is turning into **<font color="blue">blue folder logo</font>** inside our new NoSQL database

```{r, echo=FALSE}
nodes <- data.frame(id = 1:6, group = c("A", "B", "C","D","E","F"))
edges <- data.frame(from = c(1,1,1,1,1),
                    to = c(2,3,4,5,6))

visNetwork(nodes, edges, width = "100%") %>%
  visGroups(groupname = "A", shape = "icon",
            icon = list(code = "f07c", size = 75)) %>% # folder
  visGroups(groupname = "B", shape = "icon",
            icon = list(code = "f044", color = "red")) %>% # pen
  visGroups(groupname = "C", shape = "icon",
            icon = list(code = "f14e", color = "yellow")) %>% # direction
  visGroups(groupname = "D", shape = "icon",
            icon = list(code = "f086", color = "orange")) %>% # comments
  visGroups(groupname = "E", shape = "icon",
            icon = list(code = "f017", color = "green")) %>% # clock
  visGroups(groupname = "F", shape = "icon",
            icon = list(code = "f187", color = "black")) %>% # category
  addFontAwesome()
```

## Loading data into Neo4j

It should not be forgotten that the neo4j graph **just below** is just an isolated implementation of a given piece of information. Now the time has come to repeat a command looping over each news. In order to industrialise the building processes of our neo4j database, we are going to implement it on https://neo4j.com/

This big step will be the main topic of paper 2 [network_work_word_embeddings_part_2.html]

To give a first hint of what is neo4j, here is home screen of my database in which all news will be stocked...

<center>
![](neo4j_interface.PNG)
</center>

```{r, echo=FALSE}
nodes <- data.frame(id = 1:15, group = c("A", "B", "C","D","E","F","A", "B", "C","D","F","A","D", "E", "F"))
edges <- data.frame(from = c(1,1,1,1,1,7,7,7,7,7,12,12,12,12,12),
                    to = c(2,3,4,5,6,8,9,10,5,11,2,3,13,14,15))

visNetwork(nodes, edges, width = "100%") %>%
  visGroups(groupname = "A", shape = "icon",
            icon = list(code = "f07c", size = 75)) %>% # folder
  visGroups(groupname = "B", shape = "icon",
            icon = list(code = "f044", color = "red")) %>% # pen
  visGroups(groupname = "C", shape = "icon",
            icon = list(code = "f14e", color = "yellow")) %>% # direction
  visGroups(groupname = "D", shape = "icon",
            icon = list(code = "f086", color = "orange")) %>% # comments
  visGroups(groupname = "E", shape = "icon",
            icon = list(code = "f017", color = "green")) %>% # clock
  visGroups(groupname = "F", shape = "icon",
            icon = list(code = "f187", color = "black")) %>% # category
  addFontAwesome()
```


Just to provide some context of why we undertook this step - let's consider a concrete example to highlight how important links are to retrieve quickly and efficiently information as well as bringing to light unseeable links which stand for within a static and linear database.

So here we just drew some links between different news when they share some features in common such as author of the article, where news were coming from (BBC, Eurosport, ...), when news were published, from which category news fell, etc... In one word, the first thing to do is to partition/segment news according to features they jointly share. But here is not the final objective at all. We want a graph of words what we will call - later - as a distributional thesaurus network. The next step will be to turn it into dense word vectors and investigate the usefulness of distributional thesaurus embedding in improving overall word representation.

# Relationships between words

We have to take stock of what has happened so far:

- We have news - written in English according many topics. This is a **text** format. Here is the <font color="blue">micro level</font>. Clearly speaking, this is the most valuable piece of information we have. For the time being, we did not use it at all. This is the topic under scrunity.

- In addition, we have information regarding the context of news - when they are published, where they come from, from which category they fall down, who write them, ... Here is a <font color="blue">macro level</font>. We introduced it just above.

The objectives is to play with these two levels in order to classify news at best.

News are coming from different fields - which are they? and whith wich proportion?

```{r,fig.width = 10, fig.height = 6}
ggplot(news_data, aes(x=category, fill=category)) +  geom_bar() + theme_bw()
```

## N-grams and correlations

In a nutshell, the **DT network** contains, for each word, a list of words that are similar with respect to their bigram distribution (Riedl and Biemann, 2013). In the network, each word is a node and there is a weighted edge between a pair of words where the weight corresponds to the number of overlapping features.

Each bigram is broken into a word and a feature, where the feature consists of the bigram relation and the related word. The word pairs having number of overlapping features above a threshold are retained in the network.

### Category : General

A sample snapshot of Distributional Thesaurus network where each node represents a word and the weight of an edge between two words is defined as the number of context features that these two words share in common.


```{r}
general_news <- news_data %>%
  filter(category == "general") %>%
  select(description)

general_news_bigrams <- general_news %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  as.data.frame()

general_news_bigrams_counts<- general_news_bigrams %>%
  count(bigram, sort = TRUE)

kable(general_news_bigrams_counts[c(1:10),]) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```


As we can see there are a lot of StopWords such as (*["of the", "is to", "as a", ...]*- they obviously bring no information and consequently they must be removed from the text.

```{r}
bigrams_separated <- general_news_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigram_counts <- bigrams_filtered %>%
  count(word1, word2, sort = TRUE)
```

Here we keep only relevant bigrams and remove ones which have a weak occurence - we defined a theoretical threshold fixed at at least 50 times. In other words, we kept only bigrams which occurs frequently in a given text to put away those with a weak occurence.

```{r}
bigram_counts %>%
  filter(n >= 50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()
```


### Category : Sport

Let's do the same but with sport news

```{r, echo=FALSE}

sport_news <- news_data %>%
  filter(category == "sport") %>%
  select(description)

sport_news_bigrams <- sport_news %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  as.data.frame()

bigrams_separated_sport <- sport_news_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered_sport <- bigrams_separated_sport %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigram_counts_sport <- bigrams_filtered_sport %>%
  count(word1, word2, sort = TRUE)

bigram_counts_sport %>%
  filter(n >= 50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "darkred") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

```

### Category : Business

Let's do the same but with business news

```{r, echo=FALSE}

business_news <- news_data %>%
  filter(category == "business") %>%
  select(description)

business_news_bigrams <- business_news %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  as.data.frame()

bigrams_separated_business <- business_news_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered_business <- bigrams_separated_business %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigram_counts_business <- bigrams_filtered_business %>%
  count(word1, word2, sort = TRUE)

bigram_counts_business %>%
  filter(n >= 50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "royalblue") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

```

## Are there the same words in keeping with categories?

We have collected different news coming from different fields: **General**, **Business** and **Sport**. It is quite natural to expect different words in keeping with news. For instance, within Sport news we should find words such as tennis, football or some specific words used by a sport - whereas in business news we should find words related to trading, finance, marketing, stock exchange, ... The question here is to know how different news are each other?

If two words are close to the line - called bisector - in these plots, it means they have the same frequency in the two given text. In other words, words close to the line are used jointly in two categories of news.

```{r}
library(scales)
library(ggplot2)

frequency <- readRDS("C:/Users/adsieg/Desktop/part_1/frequency.RDS")

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `business news`, color = abs(`business news` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "business news", x = NULL) +
  theme_bw() +
  theme(legend.position="none")
```

Let's have an example:
Business news and general news share some words in common such as 'president', 'America', 'bank', 'Donald', 'ceo', 'apple', ... whereas

Words that are far from the line are words that are found more in one set of texts than another. For instance, words such as 'arsenal', 'chelsea', 'england', or 'donald', 'trump','administration', 'jobs', 'data' are closer than one given category rather than another one - either sport or business.

Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between <font color="blue">business</font> and the <font color="blue">general news</font>, and between <font color="blue">sport</font> and <font color="blue">business news</font>?

```{r}
cor.test(data = frequency[frequency$author == "General news",],
         ~ proportion + `business news`)
```

```{r}
cor.test(data = frequency[frequency$author == "sport news",],
         ~ proportion + `business news`)
```

# Conclusion

From one dataset we can set up two graph databases:

- The first one is related to news' context (where they come from, when they are published, in which category they fall down, ...). The context can provide with a first classification by drawing some links between news.

- The second one is related to news themselves. Based on text, we can draw some links between words and get a classifcation of news in keeping with the meaning of news.

The big challenge of this paper is then to introduce how we are going to "analyze the effect of integrating the knowledge of Distributional Thesaurus network with the state-of-the-art word representation models to prepare a better word representation." This underlying goal will be the topic of next tutorial.

- We first prepare vector representations from Distributional Thesaurus (DT) network applying network

- Next we combine this thesaurus embedding with state-of-theart vector representations prepared using GloVe and Word2vec model for analysis.

Combined representation of GloVe and DT embeddings shows **promising performance** gain over state-of-the-art embeddings

If you are interested in what we introduced, feel free to share and go to the next page [network_work_word_embeddings_part_2.html] :)

<div id="disqus_thread"></div>
<script>

/**
*  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
*  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/
/*
var disqus_config = function () {
this.page.url = PAGE_URL;  // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
*/
(function() { // DON'T EDIT BELOW THIS LINE
var d = document, s = d.createElement('script');
s.src = 'https://adriensieg.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>