Skip to content

First release (with data files)

Latest

Choose a tag to compare

@diatkinson diatkinson released this 31 Oct 14:16
· 2 commits to master since this release

Files:

cmv_triples_*.json

OP/PC/Explanation triples, in the form of a list of json dictionaries, each dictionary with the following keys:

  • op_selftext: the text of the OP.
  • deltaed_comment: what we call the "persuasive comment (PC)" in the paper.
  • explanation: the explanation.

The values of the dictionary are all strings.

train runs from January 1 2013 through April 1 2018, valid from April 1 2018 through September 1 2018, and test from September 1 2018 through Jan 31 2019.

cmv_triples_*_token.jsonlist.gz*

Tokenized versions of the corresponding json files. When extracted, each of these will contain a newline-separated list of dictionaries, with the same keys as above. However, rather than strings, the values will be lists of tokens, where each token is itself represented as a 6-element list.

For example:

{"op_selftext": [["Even", "even", "ADV", "", "RB", "advmod"], ["if", "if", "ADP", "", "IN", "mark"], ["love", "love", "NOUN", "", "NN", "nsubj"]],
"deltaed_comment": [["From", "from", "ADP", "", "IN", "prep"], ["a", "a", "DET", "", "DT", "det"], ["microbiological", "microbiolog", "ADJ", "", "JJ", "amod"], ["perspective", "perspect", "NOUN", "", "NN", "pobj"]],
"explanation": [["I", "i", "PRON", "", "PRP", "nsubj"], ["'m", "'m", "VERB", "", "VBP", "ROOT"], ["not", "not", "ADV", "", "RB", "neg"], ["certain", "certain", "ADJ", "", "JJ", "acomp"]]}

Each token consists of 6 strings, representing, respectively:

  1. the word
  2. the stemmed word
  3. the spaCy part of speech tag, corresponding to the _pos property.
  4. the named entity type, if present.
  5. the spaCy part of speech tag, corresponding to the _tag propery.
  6. the spaCy dependency label.

See https://spacy.io/api/annotation for descriptions of the POS, dependency, and named entity values.