Skip to content

bnika/MerkOrExtraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MerkOrExtraction

A package to extract semantic information from texts and create a semantic database from that information.

Preprocessing steps are included for Icelandic.

Preprocessing

Prerequisites:

  1. an IceNLP IceTagger tagged text (or texts) (http://sourceforge.net/projects/icenlp/).
  2. Beygingarlýsing íslensks nútímamáls (BÍN) in a PostgreSQL database (http://bin.arnastofnun.is).

Create a database table of wordforms with mapping between IceNLP format and BÍN format:

Run:

java -jar MerkorExtraction.jar -bin_mapping -input <input_file_or_directory>

If -input is a directory, all files in that directory will be processed. This program has the following output:

  • wordforms_nouns.sql
  • wordforms_l_adjectives.sql
  • wordforms_s_verbs.sql
  • nonValidWords_nouns.txt
  • nonValidWords_l_adjectives.txt
  • nonValidWords_s_verbs.txt

The .sql files contain sql-statements to insert into database. The statements have the following format:

INSERT INTO wordforms_nouns VALUES ('fólki', 'nheþ', 'ÞGFET')

where the first value is the wordform, the second value the ice-nlp tag and the third one the bin-tag.

During the process above, all so called non valid words tagged as nouns, adjectives or verbs are sorted out and written to file 'nonValidWords_wordclass.txt'. Example:

fl.
iii.
sl.
ungrúísland.is
floor
...

Processing time on MacBook Pro for 53.7 MB input file: 6:14 minutes.

Now perform steps 1 and 2 in the database construction, see data/merkorDBconstruction.sql.

Populate database from file:

Run: java -jar MerkorExtraction.jar -fill_db -db_name databasename -db_conn databaseconnection -password yourpassword -input sql-file

You can also set the default db-connection in class is.merkor.util.database.DBConnection.java

Processing time on MacBook Pro for 89 MB input file: 13:22 minutes. Processing time on MacBook Pro - real data: nouns (4.67GB) - 13h10m; adj (1.22GB) - 2h51m; verbs (3.15GB) - 8h36m

Improvement suggestion: use Postgresql 'COPY' instead of 'INSERT INTO' (see postgresql doc), much faster.

Lemmatizing

The MerkorLemmatizer is a standalone application, but also accessible from within MerkOrExtraction as BINLemmatizer.

TODO: define command line parameters for lemmatizer in MerkOrExtraction.

Pattern Extraction

If you already have patterns for relation extraction, skip this step!

Define the rules for the extraction of patterns:

One can define various rules for the initial extraction of patterns, based on what kind of relations are to be extracted later. In the following the rules for the extraction of noun phrases and prepositional phrases are described. The aim is to extract relations between nouns, using enumeration of nouns, genetive constructions, and prepositions, along with relations between adjectives and adjectives and nouns. Patterns containing verbs are excluded as a method of narrowing down the problem (such patterns would indeed most likely be very useful).

The extraction in MerkOrExtraction is based on the IceNLP tagset. If you want to use MerkOr to extract other kinds of patterns in other format, please have a look at the classes is.merkor.patternextraction.PatternExtraction.java and is.merkor.patternextraction.PatternInfo.java. Unfortunately many rules are hard coded, so you might even be better off writing your own pattern extraction class if many changes are needed.

Rules:

Allowed phrases in a pattern:
    Noun phrase ([NP), a prepositional phrase ([PP), or an adjective phrase ([AP).
    Conjunctions are allowed, but a pattern may not end with a conjunction phrase ([CC).

A pattern may not end with a NP or a PP that does not contain a noun.
A pattern has to contain at least two nouns, or two adjectives, or one noun and one adjective.

For each extracted pattern an abstract version and the realisation is returned.

Example:
Input: ...
[SCP að c SCP]
[NP markmiðið nheng NP]
[PP með aþ [NP komu nveþ sinni feveþ NP] PP]
[AdvP hingað aa AdvP]
...

Output: [NP nxeng ] [PP með  aþ [NP nxeþ  fexeþ ]] =>   
[NP markmiðið nheng NP] [PP með aþ [NP komu nveþ sinni feveþ NP] PP]

The 'x' in the pattern stands for the gender tag which has been neutralized.

To run the MerkOr pattern extraction as is:
java -jar MerkorExtraction.jar -extract_patterns -input <inputfile_or_dir> -output <output.csv>

Patterns to database

The patterns should be written to a database of the form:

Table pattern:  
id (pk) - pattern - nr_of_occurrences - relation  
Table pattern_realisation:  
id (pk) - pattern_id (fk) - text  

In the present work there is only one db-table, storing all pattern realisation as an array of texts with the respective pattern. This has to be re-organized, as well as the classes (in pckg is.merkor.patternextraction.patterns_to_db) that write the patterns into the db have to be completely rewritten.

Choose promising patterns (manual work!)

Having all patterns with their realisations in a database, the 'PatternVerificationTool' can be used to quickly scanning over the patterns with their realisations to asset if they are likely to represent a relation. To have a look at this run is.merkor.patternextraction.VerificationFrameTabbed.java.

After this work one can extract all classified patterns for further processing. For MerkOr all patterns occurring at least 10 times in the corpus (about 5300 patterns) were manually scanned, resulting in about 2300 potentially useful patterns.

Merging patterns - edit distance

All patterns classified as indicating some relation are merged using the Levenshtein edit distance measure. Example:

Original patterns:
[PP í  aþ [NP nxeþ ][NP nxee-ö ]]
[PP Í  aþ [NP nxeþ ][NP nxeeg ]]
Normalized patterns:
[pp í aþ [np nþ][np ne-s]]
[pp í aþ [np nþ][np neg]] 
Merged pattern:
[pp í  aþ [np nþ ][np ne-s|neg ]]

If you have your patterns in a database, set the name of your database in is.merkor.patternextraction.PatternMerger#getPatternsWithRelation()

Then run:

java -jar MerkorExtraction.jar -merge_patterns -output <output_file>  
-relation <relation_of_patterns_to_merge> -password <your_db_passwd>

Running this should show something like (results are written to output_file):

connected to database: patterns
--- merged 388 patterns for relation 'genitive'
--- nr. of merged patterns: 114

This significantly reduces the number of patterns to handle in the relation extraction - additionally further manual mergin using more complicated regular expressions was performed for MerkOr.
NOTE: It is though also possible to leave this step out completely, that way more exact data about reliability of single patterns can be collected later on.

Relation Extraction

Package is.merkor.relationextraction

UIMA

For the relation extraction a UIMA pipeline was implemented. See Apache UIMA. The UIMA pipeline uses external descriptors that describe the annotation task for a given analysis process, and resources like for example the patterns generated in the last step. In the following the UIMA pipeline for relation extraction is described.

Annotating words and part-of-speech tags

UIMA is a framework that allows developers to define annotation tasks and provides for example a mechanism that defines start and end positions of the annotations in the text.
The first task is to annotate all valid word and pos-tags. Example:

....  
[VP Eiga sfg3fn VP]  
[NP rúm nhfn 50% tp NP]  
[PP í aþ [NP Keflavíkurverktökumundirfyrirsögn nkeþ-m NP] PP]  
...  

In this text there are three Word objects (begin and end positions refer to a larger text):

Word ("Eiga"), begin = 227, end = 231, word_string = Eiga  
Word ("rúm"), begin = 247, end = 250, word_string = rúm  
Word ("í"), begin = 271, end = 272, word_string = í   

The other two potential word objects, '50%' and 'Keflavíkurverktökumundirfyrirsögn', are not valid lexical items and hence are ignored. All POS-tags are analysed:

POS ("sfg3fn"), begin = 232, end = 238, word_class = verb, casus = "", TreeTaggerTag = VV  
POS ("nhfn"), begin = 251, end = 255, word_class = noun, casus = nominative, TreeTaggerTag = NNS  
POS ("tp"), begin = 260, end = 262, word_class = number, casus = "", TreeTaggerTag = CD  
POS ("aþ"), begin = 273, end = 275, word_class = adverb, casus = "", TreeTaggerTag = IN  
POS ("nkeþ-m"), begin = 314, end = 320, word_class = noun_proper, casus = dative, TreeTaggerTag = NP

Note that a mapping to a common international tagset is included, here marked as 'TreeTaggerTag', referring to TreeTagger.

Every valid Word-POS pair is finally combined to a PairWordPOS object (in these examples no valid lemma was found for the words):

PairWordPOS ("Eiga sfg3fn"), begin = 227, end = 238, pos = POS ("sfg3fn"), word = Word ("Eiga"), lemma = null  
PairWordPOS ("rúm nhfn"), begin = 247, end = 255, pos = POS ("nhfn"), word = Word ("rúm"), lemma = null    
PairWordPOS ("í aþ"), begin = 271, end = 275, pos = POS ("aþ"), word = Word ("í"), lemma = null

The descriptor for this annotation task is PairInOneDetector.xml and necessary resource files are nonWordRegEx.txt and posTagMapping.txt. To be able to view the annotations in the UIMA Annotation Viewer (found as a .launch file in UIMA run-configuration), PairInOneDetectorStyleMap.xml descriptor is also necessary. This annotation can of course be run on its own, but for MerkOr its main purpose is to prepare for relation extraction.

Annotating semantic relations

The descriptor for the relation extraction is RelationDetector.xml and RelationAnnotator.xml, and the resource file containing the regular expressions for relation extraction is ruleMapping.txt. The files needed for the annotation of words and part-of-speech are also needed here, but the annotation process is included in the pipeline so the relation extraction is run directly on IceNLP parsed text.
An example for annotated relations:

...  
[SCP að c SCP]  
[NP breytingar nvfn NP]  
[PP á aþ [NP eignarhaldi nheþ NP] PP]  
[VP verði svg3en VP]  
...

The relation found in this text:

SemRelation("[NP breytingar nvfn NP] [PP á aþ [NP eignarhaldi nheþ NP] PP]")  
begin = 4923, end = 4984, relation = áDat, word1 = breyting_11402, word2 = eignarhald_427818

Since the relation annotator has access to the annotations of the word-pos annotator, it is possible to detect lemmata for the relations. The numbers attached to the lemmata are the ids from the BIN database (see Preprocessing above).

Running annotators

At the moment it is not possible to run the annotators from the command line (coming up!), they have to be run from is.merkor.cli.MainAnnotators. Running from Eclipse: set arguments in Run As -> Run Configurations ... as descriptors/RelationDetector.xml directory-of-the-files-to-analyze.
In the command line interface there will also be a flag option for the output: the output for the UIMA Annotation Viewer (or some other .xmi consuming program) is very large and can be commented out for large input. At the moment the call to the write-output method has to be commented out in is.merkor.relationextraction.MerkorEngine#processBuffer, comment out the line writeAnnotationsForFile().
The results needed for further processing of MerkOr are written to a folder relationDetectorResults, each relation to its own file, e.g. coordNouns.csv. Note that the writing mode is appending - each run of the annotator appends to the already created relation files. An option to choose 'append' or 'overwrite' will be included in the command line interface when ready.

The format of the result files, example from coordNouns.csv:

ýsa_14811		steinbítur_102562	[NPs [NP ýsu nveþ NP] , , [NP ufsa nkeþ NP] [CP og c CP] [NP steinbít nkeþ NP] NPs]  
steinbítur_102562	ufsi_7690	[NPs [NP ýsu nveþ NP] , , [NP ufsa nkeþ NP] [CP og c CP] [NP steinbít nkeþ NP] NPs]  
ýsa_14811		ufsi_7690	[NPs [NP ýsu nveþ NP] , , [NP ufsa nkeþ NP] [CP og c CP] [NP steinbít nkeþ NP] NPs]  

Relations are always binary, so patterns containing relations between more than two words are splitted to the necessary number of binary relations. Each line contains the first word of the relation, the second word, and the realisation of the pattern it was extracted from, all tab-separated.

Storing Relations in a database

All extracted relations are to be stored in a database. Running

java -jar MerkorExtraction.jar -relations2dbstatements

results in a file insertToLexRel.sql containing sql insert statements for all relations from the relation extraction where both words were lemmatized and connected to an id from the BÍN database. The insert statements only contain ids - ids of lexical items and ids of relation types. The insert statement for the 'ýsa - steinbítur' example above would be (the relation type id for coordinated nouns is 7):

INSERT INTO lex_relations_complete (rel_id, from_lex_unit, to_lex_unit) VALUES (7, 14811. 102562); 

After the lex_relations_complete table has been created, the -fill_db option for MerkorExtraction can be used to execute the insert statements.

About

a package for the extraction of semantic information from texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages