Skip to content

how is the text preprocessing done ? #36

@iTomxy

Description

@iTomxy

Hi, I want to extract the doc2vec features of those sentences in MS COCO. But I'm not quite sure how the preprocessing is performed.

It's said that the articles are tokenised and lowercased using Stanford CoreNLP in the paper. From the files under toy_data/ and the two py files, I guess that an article is squashed into a single line in those *_docs.txt files. But these two files are already processed.

Now I've installed the Stanford CoreNLP and can call it from command line. After concatenating the 5 sentences for a COCO image (seperated by a space), treating is as an article, and writing it into input.txt, my calling is like:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -outputFormat conll -output.columns word -file input.txt

However, the output is not lowercased. How should I modify the command to enable lowercasing ?

By the way, there are other tokenization options shown here, like americanize. Did you use them when training the doc2vec model ? If possible, I hope you can provide the details of your preprocessing method.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions