-
Notifications
You must be signed in to change notification settings - Fork 0
source files to install the Treetagger POS library
License
Klen900/TreeTaggerInstall
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
/***************************************************************************/
/* How to use the TreeTagger */
/* Author: Helmut Schmid, University of Stuttgart, Germany */
/***************************************************************************/
The TreeTagger consists of two programs: train-tree-tagger is used to
create a parameter file from a lexicon and a handtagged corpus.
tree-tagger expects a parameter file and a text file as arguments and
annotates the text with part-of-speech tags. The file formats are
described below. By default, the programs are located in the ./bin
sub-directory.
If either of the programs is called without arguments, it will print
information about its usage.
Tagging
-------
Tagging is done with the tree-tagger program. It requires at least one
command line argument, the parameter file. If no input file is specified,
input will be read from stdin. If neither an input file nor an output file
is specified, the tagger will print to stdout.
tree-tagger {-options-} <parameter file> {<input file> {<output file>}}
Description of the command line arguments:
* <parameter file>: Name of a parameter file which was created with the
train-tree-tagger program.
* <input file>: Name of the file which is to be tagged. Each token in this
file has to be on a separate line. Tokens may contain blanks. It is possible
to override the lexical information contained in the parameter file of the
tagger by specifying a list of possible tags after a token. This list has
to be preceded by a tab character and the elements are separated by tab
characters. This pretagging feature could be used e.g. to ensure that
certain text-specific expressions are tagged properly.
Punctuation marks must be on separate lines as well. Clitics (like "'s",
"'re", and "'d" in English or "-la" and "-t-elle" in French) should be
separated if they were separated in the training data. (The French and
English parameter files available by ftp expect separation of clitics).
Sample input file:
He
moved
to
New York City NP
.
* <output file>: Name of the file to which the tagger should write its output.
Further optional command line arguments:
* -token: tells the tagger to print the words also.
* -lemma: tells the tagger to print the lemmas of the words also.
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
with '>' (SGML tags).
- -no-unknown: If an unknown word is encountered, emit the word form
as lemma. This was previously the default behaviour. Now, the default
behaviour is to print "<unknown>" as lemma.
- -threshold <p>: This option tells the tagger to print all tags of a
word with a probability higher than <p> times the largest probability.
(The tagger will use a different algorithm in this case and the set of
best tags might be different from the tags generated without this
option.)
- -prob: Print tag probabilities (in combination with option -threshold)
- -pt-with-prob: If this option is specified, then each pretagging tag
(see above) has to be followed by a whitespace and a tag probability
value.
- -pt-with-lemma: If this option is specified, then each pretagging tag
(see above) has to be followed by a whitespace and a lemma. Lemmas may
contain blanks.
If both -pt-with-prob and -pt-with-lemma have been specified, then each
pretagging tag is followed by a probability and a lemma in that order.
The options below are for advanced users. Please, read the papers on the
TreeTagger to fully understand their meaning.
* -proto: If this option is specified, the tagger creates a file named
"lexicon-protocol.txt", which contains information about the degree of
ambiguity and about the other possible tags of a word form. The part of
the lexicon in which the word form has been found is also indicated. 'f'
means fullform lexicon and 's' means affix lexicon. 'h' means that the
word contains a hyphen and that the part of the word following the
hyphen has been found in the fullform lexicon.
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
This is the case if a word/tag pair is contained in the lexicon but not
in the training corpus. The choice of this parameter has only minor
influence on the tagging accuracy.
* -base: If this option is specified, only lexical information is used
for tagging but no contextual information about the preceding tags.
This option is only useful in order to obtain a baseline result
to which to compare the actual tagger output.
Training
--------
Training is done with the *train-tree-tagger* program. It expects at least
four command line arguments which are described below.
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>
Description of the command line arguments:
* <lexicon>: name of a file which contains the fullform lexicon. Each line
of the lexicon corresponds to one word form and contains the word form
and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
and each lemma is preceded by a blank or tab character.
Example:
aback RB aback
abacuses NNS abacus
abandon VB abandon VBP abandon
abandoned JJ abandoned VBD abandon VBN abandon
abandoning VBG abandon
Attention: Ordinal and cardinal numbers which consist of digits
(like 1, 13, 1278 or 2. and 75.) should not be included in the
lexicon. Otherwise, the tagger will not be able to learn how to tag
numbers which are not listed in the lexicon. Numbers with unusual
tags should be added to the lexicon, however. If the training
program reports an error because the POS tag used for numbers is
unknown, you should add a lexicon entry for one number.
Remark: The tagger doesn't need the lemmata for tagging actually. If
you do not have the lemma information or if you do not plan to
annotate corpora with lemmas, you can replace the lemma with a dummy
value, e.g. "-".
* <open class file>: name of a file which contains a list of open class tags
i.e. possible tags of unknown word forms separated by whitespace.
The tagger will use this information when it encounters unknown words,
i.e. words which are not contained in the lexicon.
Example: (for Penn Treebank tagset)
FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ
* <input file>: name of a file which contains tagged training data. The data
must be in one-word-per-line format. This means that each line contains
one token and one tag in that order separated by a tabulator.
Punctuation marks are considered as tokens and must be tagged as well.
The file should neither contain empty lines nor untagged SGML markup.
Example:
Pierre NP
Vinken NP
, ,
61 CD
years NNS
* <output file>: name of the file in which the resulting tagger parameters
are stored.
The following parameters are optional. Read the papers on the TreeTagger to
fully understand their meaning.
* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
is assigned to sentence punctuation like ".", "!", "?".
Default is "SENT". It is important to set this option properly, if your
tag for sentence punctuation is not "SENT".
* -cl <context length>: number of preceding words forming the statistical
context. The default is 2 which corresponds to a trigram context. For
small training corpora and/or large tagsets, it could be useful to reduce
this parameter to 1.
* -dtg <min. decision tree gain>: Threshold - If the information gain at a
leaf node of the decision tree is below this threshold, the node is deleted.
* -sw <weight>: A smoothing parameter, which determines how much the
probability distribution of some decision tree node is smoothed with the
probability distribution of the parent node.
* -ecw <eq. class weight>: weight of the equivalence class based probability
estimates.
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
affix tree is below this threshold, it is deleted. The default is 1.2.
The accuracy of the TreeTagger usually improves, if different settings
of the above parameters are tested and the best combination is chosen.
Caveat: Make sure that the lexicon and the training corpus contain no
extra blanks. If the word form, for instance, is followed by a blank
and a tab character, the blank will be considered part of the word.
About
source files to install the Treetagger POS library
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published