Files:
cmv_triples_*.json
OP/PC/Explanation triples, in the form of a list of json dictionaries, each dictionary with the following keys:
op_selftext: the text of the OP.deltaed_comment: what we call the "persuasive comment (PC)" in the paper.explanation: the explanation.
The values of the dictionary are all strings.
train runs from January 1 2013 through April 1 2018, valid from April 1 2018 through September 1 2018, and test from September 1 2018 through Jan 31 2019.
cmv_triples_*_token.jsonlist.gz*
Tokenized versions of the corresponding json files. When extracted, each of these will contain a newline-separated list of dictionaries, with the same keys as above. However, rather than strings, the values will be lists of tokens, where each token is itself represented as a 6-element list.
For example:
{"op_selftext": [["Even", "even", "ADV", "", "RB", "advmod"], ["if", "if", "ADP", "", "IN", "mark"], ["love", "love", "NOUN", "", "NN", "nsubj"]],
"deltaed_comment": [["From", "from", "ADP", "", "IN", "prep"], ["a", "a", "DET", "", "DT", "det"], ["microbiological", "microbiolog", "ADJ", "", "JJ", "amod"], ["perspective", "perspect", "NOUN", "", "NN", "pobj"]],
"explanation": [["I", "i", "PRON", "", "PRP", "nsubj"], ["'m", "'m", "VERB", "", "VBP", "ROOT"], ["not", "not", "ADV", "", "RB", "neg"], ["certain", "certain", "ADJ", "", "JJ", "acomp"]]}
Each token consists of 6 strings, representing, respectively:
- the word
- the stemmed word
- the spaCy part of speech tag, corresponding to the
_posproperty. - the named entity type, if present.
- the spaCy part of speech tag, corresponding to the
_tagpropery. - the spaCy dependency label.
See https://spacy.io/api/annotation for descriptions of the POS, dependency, and named entity values.