how is the text preprocessing done ?

Hi, I want to extract the doc2vec features of those sentences in MS COCO. But I'm not quite sure how the preprocessing is performed.

It's said that the articles are tokenised and lowercased using Stanford CoreNLP in the paper. From the files under *toy_data/* and the two py files, I guess that an article is squashed into a single line in those *\*_docs.txt* files. But these two files are already processed.

Now I've installed the Stanford CoreNLP and can call it from command line. After concatenating the 5 sentences for a COCO image (seperated by a space), treating is as an article, and writing it into *input.txt*, my calling is like:

```shell
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -outputFormat conll -output.columns word -file input.txt
```

However, the output is **not** lowercased. How should I modify the command to enable lowercasing ?

By the way, there are other tokenization options shown [here](https://stanfordnlp.github.io/CoreNLP/tokenize.html), like `americanize`. Did you use them when training the doc2vec model ? If possible, I hope you can provide the details of your preprocessing method.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how is the text preprocessing done ? #36

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

how is the text preprocessing done ? #36

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions