Skip to content

Analysis of Online Conversations to Detect Cyberpredators Using Recurrent Neural Networks; 2020; Kim et al. #36

@hamedwaezi01

Description

@hamedwaezi01

Analysis of Online Conversations to Detect Cyberpredators Using Recurrent Neural Networks
Jinhwa Kim, Yoon Jo Kim, Mitra Behzadi, Ian G. Harris
STOC@LREC 2020

Why I chose this paper?
It is using RNN with dense vector spaces as feature vector. The input is the dense vector of a whole message (actually several messages) which are passed to the RNN. And also it is a bit recent comparing other works.

The main problem:
Finding whether or not a conversation is predatory or not

The secondary problem:
The previous works usually used bag-of-words representations with neural networks. This work tries to investigate the effect of using dense vectors.

Applications:
(server-based) online predatory detection considering the context of the messages.

Method:
Their system consists of two stages: Message Labeling and Conversation Classification
According to the literature, a predatory conversation goes through different stages.

Message Labeling
Message Labeling

These stages/classes of a predatory conversation is handled in the first stage which labels them in four different classes: Exchange of Personal Information, Grooming, Approach, and non-predatory.
The messages are encoded using a sentence embedding and then passed to an LSTM model which at the end generates the label for a given message. Notably, for every message being passed, the four previous messages are also passed so the model can understand the context better.

Conversation Classification
Conversation Classification

Then the sequence of labels are passed to another LSTM one by one. It seems that the output of all elements in the sequence is passed to a linear layer at the end. They pad the sequences of labels to a constant size followed by a masking to ensure consistency of sizes with the final linear layer.

The training of stage one is on ChatCoder 2 dataset, and the pre-trained model was used in conversation classification stage.

Input and Output:
Stage 1:
input: Text embedded using Universal Sentence Encoder (USE)
output: A sequence of labels indicating the class of each message in the aforementioned categories.
The dataset for training the model of stage 1, was ChatCoder 2.

Stage 2:
input: Stage 1's input
output: the conversation is predatory or not?

Gaps:
The conversation classification model was trained and tested on a dataset which is different from the test set of PAN12. According to their table 2. total number of conversations is 480 (128 predatory) while the PAN12 test set has 222055 conversations in total. They say they compare their results to that of PAN12 competition, and they state their result is better. I assume it is a false claim because of the manipulation in the dataset.

Results:
F0.5 = 0.9058
F1 = 0.9148
F2 = 0.9295

image

Metadata

Metadata

Assignees

Labels

literature-reviewSummary of the paper related to the work

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions