-
Notifications
You must be signed in to change notification settings - Fork 190
Description
Hi, I want to extract the doc2vec features of those sentences in MS COCO. But I'm not quite sure how the preprocessing is performed.
It's said that the articles are tokenised and lowercased using Stanford CoreNLP in the paper. From the files under toy_data/ and the two py files, I guess that an article is squashed into a single line in those *_docs.txt files. But these two files are already processed.
Now I've installed the Stanford CoreNLP and can call it from command line. After concatenating the 5 sentences for a COCO image (seperated by a space), treating is as an article, and writing it into input.txt, my calling is like:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -outputFormat conll -output.columns word -file input.txtHowever, the output is not lowercased. How should I modify the command to enable lowercasing ?
By the way, there are other tokenization options shown here, like americanize. Did you use them when training the doc2vec model ? If possible, I hope you can provide the details of your preprocessing method.
Thanks