-
Notifications
You must be signed in to change notification settings - Fork 1
kaarthic/mallet-eval
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
20110422 MALLET-EVAL PROJECT
GENERAL
This is a project for evaluating MALLET (MAchine Learning for
LanguagE Toolkit). MALLET's binary and source codes are not included,
you can check out them from this site:
http://mallet.cs.umass.edu/
This distribution only contains sample annotation data and scripts for
converting, importing and evaluating. The articles in the two corpora are
not included here for copyright reasons. That is why you need their cds
for building the complete data sets.
We provide two sample corpora: Penn Treebank Sample (5% fragment of Penn
Treebank) and HIT CIR LTP Corpora Sample (10% fragemnt of the whole
Corpora)
http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/
http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
BUILDING THE TRAIN AND TEST DATA FILES
In order to obtain the data files you need to perform three steps:
1. Get a local copy of the mallet-eval repository with this command:
hg clone https://mallet-eval.googlecode.com/hg/ mallet-eval
2. Set up $MALLET_HOME enviroment: export MALLET_HOME=/path/to/mallet/
3. Train and test with provided Chunking, POS Tagging and Named Entity
Recognition data (chunking/ pos-tagging/ ner/)
4a. (Chunking) ./conlleval < chunking/conlleval.out
4b. (POS-Tagging) cd pos-tagging | ./verify.py
4c. (Named Entity Recognition) ./chn-conlleval < ner/conlleval.out
and the results are:
processed 47377 tokens with 23852 phrases; found: 23682 phrases; correct: 21441.
accuracy: 93.97%; precision: 90.54%; recall: 89.89%; FB1: 90.21
ADJP: precision: 72.35%; recall: 63.93%; FB1: 67.88 387
ADVP: precision: 78.61%; recall: 75.98%; FB1: 77.28 837
CONJP: precision: 40.00%; recall: 44.44%; FB1: 42.11 10
INTJ: precision: 50.00%; recall: 50.00%; FB1: 50.00 2
LST: precision: 0.00%; recall: 0.00%; FB1: 0.00 2
NP: precision: 90.05%; recall: 89.57%; FB1: 89.81 12355
PP: precision: 94.97%; recall: 96.88%; FB1: 95.92 4908
PRT: precision: 71.84%; recall: 69.81%; FB1: 70.81 103
SBAR: precision: 89.01%; recall: 78.69%; FB1: 83.53 473
VP: precision: 91.55%; recall: 90.51%; FB1: 91.03 4605
DATA FORMAT
The data files contain one word per line. Empty lines have been used
for marking sentence boundaries and a line containing the keyword
-DOCSTART- has been added to the beginning of each article in order
to mark article boundaries. Each non-empty line contains the following
tokens:
1. the current word
2. the lemma of the word (German only)
3. the part-of-speech (POS) tag generated by a tagger
4. the chunk tag generated by a text chunker
5. the named entity tag given by human annotators
The tagger and chunker for English are roughly similar to the
ones used in the memory-based shallow parser demo available at
http://ilk.uvt.nl/ German POS and chunk information has been
generated by the Treetagger from the University of Stuttgart:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
In order to simulate a real natural language processing
environment, the POS tags and chunk tags have not been checked.
This means that they will contain errors. If you have access to
annotation software with a performance that is superior to this,
you may replace these tags by yours.
The chunk tags and the named entity tags use the IOB1 format. This
means that in general words inside entity receive the tag I-TYPE
to denote that they are Inside an entity of type TYPE. Whenever
two entities of the same type immediately follow each other, the
first word of the second entity will receive tag B-TYPE rather than
I-TYPE in order to show that a new entity starts at that word.
The raw data has the same format as the training and test material
but the final column has been ommitted. There are word lists for
English (extracted from the training data), German (extracted from
the training data), and Dutch in the directory lists. Probably you
can use the Dutch person names (PER) for English data as well. Feel
free to use any other external data sources that you might have
access to.
Max Lv <lch@fudan.edu.cn>
About
Automatically exported from code.google.com/p/mallet-eval
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published