Analysis of Online Conversations to Detect Cyberpredators Using Recurrent Neural Networks; 2020; Kim et al.

**[Analysis of Online Conversations to Detect Cyberpredators Using Recurrent Neural Networks](https://dblp.org/rec/conf/lrec/KimKBH20)**
**Jinhwa Kim, Yoon Jo Kim, Mitra Behzadi, Ian G. Harris**
[STOC@LREC 2020](https://dblp.org/db/conf/lrec/stoc2020.html#KimKBH20)

**Why I chose this paper?**
It is using RNN with dense vector spaces as feature vector. The input is the dense vector of a whole message (actually several messages) which are passed to the RNN. And also it is a bit recent comparing other works.

**The main problem:**
Finding whether or not a conversation is predatory or not

**The secondary problem:**
The previous works usually used bag-of-words representations with neural networks. This work tries to investigate the effect of using dense vectors.

**Applications:**
(server-based) online predatory detection considering the context of the messages.

**Method:**
Their system consists of two stages: **Message Labeling** and **Conversation Classification**
According to the literature, a predatory conversation goes through different stages.

Message Labeling
![Message Labeling](https://github.com/fani-lab/Osprey/assets/124281176/4a87ed07-db07-4fda-8ea1-c362f22de422)

These stages/classes of a predatory conversation is handled in the first stage which labels them in four different classes: Exchange of Personal Information, Grooming, Approach, and non-predatory.
The messages are encoded using a sentence embedding and then passed to an LSTM model which at the end generates the label for a given message. Notably, for every message being passed, the four previous messages are also passed so the model can understand the context better.

Conversation Classification
![Conversation Classification](https://github.com/fani-lab/Osprey/assets/124281176/1c5eb485-41ec-4a31-8206-0d3483d451fc)

Then the sequence of labels are passed to another LSTM one by one. It seems that the output of all elements in the sequence is passed to a linear layer at the end. They pad the sequences of labels to a constant size followed by a masking to ensure consistency of sizes with the final linear layer.

_**The training of stage one is on ChatCoder 2 dataset, and the pre-trained model was used in conversation classification stage.**_

**Input and Output:**
Stage 1:
input: Text embedded using Universal Sentence Encoder (USE)
output: A sequence of labels indicating the class of each message in the aforementioned categories.
The dataset for training the model of stage 1, was ChatCoder 2.

Stage 2:
input: Stage 1's input
output: the conversation is predatory or not?


**Gaps:**
The conversation classification model was trained and tested on a dataset which is different from the test set of PAN12. According to their table 2. total number of conversations is 480 (128 predatory) while the PAN12 test set has 222055 conversations in total. They say they compare their results to that of PAN12 competition, and they state their result is better. I assume it is a false claim because of the manipulation in the dataset.


**Results:**
F0.5 = 0.9058
F1 = 0.9148
F2 = 0.9295

![image](https://github.com/fani-lab/Osprey/assets/124281176/cc0be5fe-42ab-4ed6-a88a-47a881baf494)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analysis of Online Conversations to Detect Cyberpredators Using Recurrent Neural Networks; 2020; Kim et al. #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Analysis of Online Conversations to Detect Cyberpredators Using Recurrent Neural Networks; 2020; Kim et al. #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions