Skip to content
attardi edited this page Dec 28, 2014 · 4 revisions

Training a POS tagger requires the following data: - an annotated corpus in tsv format, containing one token per line, consisting of two fields: form POS tag. Sentences are separated by an empty line - word embeddings created using nlpnet

For training the tagger, one can use the following command:

 nlpnet-train.py pos  [-h] [-w WINDOW] [-f NUM_FEATURES]
                           [--load_features] [--load_network] [-e ITERATIONS]
                           [-l LEARNING_RATE] [--lf LEARNING_RATE_FEATURES]
                           [--lt LEARNING_RATE_TRANSITIONS] [-a ACCURACY]
                           [-n HIDDEN] [-v] --gold GOLD --data DATA
                           [--variant VARIANT] [--caps [CAPS]]
                           [--suffix [SUFFIX]] [--suffix_size SUFFIX_SIZE]
                           [--prefix [PREFIX]] [--prefix_size PREFIX_SIZE]
 optional arguments:
  -h, --help            show this help message and exit
  -w WINDOW, --window WINDOW
                        Size of the word window (default 5)
  -f NUM_FEATURES, --num_features NUM_FEATURES
                        Number of features per word (default 50)
  --load_features       Load previously saved word type features (overrides -f
                        and must also load a dictionary file)
  --load_network        Load previously saved network
  -e ITERATIONS, --epochs ITERATIONS
                        Number of training epochs (default 100)
  -l LEARNING_RATE, --learning_rate LEARNING_RATE
                        Learning rate for network weights (default 0.001)
  --lf LEARNING_RATE_FEATURES
                        Learning rate for features (default 0.01)
  --lt LEARNING_RATE_TRANSITIONS
                        Learning rate for transitions (default 0.01)
  -a ACCURACY, --accuracy ACCURACY
                        Desired accuracy per tag.
  -n HIDDEN, --hidden HIDDEN
                        Number of hidden neurons (default 200)
  -v, --verbose         Verbose mode
  --gold GOLD           File with annotated data for training.
  --data DATA           Directory to save new models and load partially
                        trained ones
  --variant VARIANT     If "polyglot" use Polyglot case conventions; if
                        "senna" use SENNA conventions.
  --caps [CAPS]         Include capitalization features. Optionally, supply
                        the number of features (default 5)
  --suffix [SUFFIX]     Include suffix features. Optionally, supply the number
                        of features (default 5)
  --suffix_size SUFFIX_SIZE
                        Use suffixes up to this size (in characters, default
                        5). Only used if --suffix is supplied
  --prefix [PREFIX]     Include prefix features. Optionally, supply the number
                        of features (default 2)
  --prefix_size PREFIX_SIZE
                        Use prefixes up to this size (in characters, default
                        5). Only used if --suffix is supplied

Clone this wiki locally