GitHub - devalapr/IE_LegalText_RUTA: Information Extraction From Legal Text using Eclipse UIMA RUTA

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.settings		.settings
RefExtractor_lib		RefExtractor_lib
bin		bin
descriptor		descriptor
input		input
output		output
resources		resources
script		script
src/main		src/main
.buildpath		.buildpath
.classpath		.classpath
.gitignore		.gitignore
.project		.project
EurLexExtractor.jar		EurLexExtractor.jar
InputFile.txt		InputFile.txt
Lengthy_Annotations.py		Lengthy_Annotations.py
README.txt		README.txt
RefExtractor.jar		RefExtractor.jar
json-simple-1.1.1.jar		json-simple-1.1.1.jar
pom.xml		pom.xml

Repository files navigation

*************** INSTALLING RUTA AND RUNNING THE PROJECT ********************

1. Install ECLIPSE and add UIMA and UIMA RUTA. Refer (https://uima.apache.org/ruta.html)
2. Select the RUTA Perspective
3. Import the project as MAVEN Project.

*************** EXECUTION FILES FOR UIMA RUTA ********************

Files to be run in the following sequence:
1. DBPedia.java stored at /src/main/java -- Get the DBPedia Links
2. Main.ruta stored at /script -- Run and View the Annotations
3. ConfigurableExporter.java stored at /src/main/java -- Generate the output in .txt format

Output Files:
1. 2_Geschäftsbeziehung_und_Bankvertrag_raw.txt.xmi stored at /output -- Output in .xmi format
2. toc.txt stored at /output -- Output in .txt format
3. dboutput.txt stored at /resources -- Output for DBpedia Annotations

*************** GETTING STATISTICS IN UIMA RUTA ********************

Evaluation setup:
UIMA(statistics view)

To get RUTA statistics: "view eclipse->window->show view->other->UIMA RUTA->statistics view"

By default the statistics is set to false in RUTA in order to change this configuration parameter. This needs to be TRUE to get the statistics populated.

Go to the BasicEngine.xml in resources of the project and add the following:

<configurationParameterSettings>
<nameValuePair>
<name>statistics</name>
<value>
<boolean>true</boolean>
</value>
</nameValuePair>
</configurationPrameterSettings>

Add the above name value pair in the configuration parameter settings and then run RUTA script that will give the statistics of time spent on each step of executing the RUTA Script

********* DBPEDIA TAGGER WORK AROUND *************

For dbpedia api we had used the following parameters and end points
Endpoint:http://model.dbpedia-spotlight.org/de/annotate
input params:
1. Text
2. Confidence
3. Support

****** ADDING MORE INFORMATION ABOUT THE PYTHON SCRIPT AND THE JAR FILE ******
Python script to find lengthy annotations so that false positives can be identified manually.
The script takes two arguments:
Usage: python3 Lengthy_Annotations.py inputfolder outputfolder

The program asks for a standard deviation number. The idea behind this is that if there are 10 gerichts with mean length of 10, and 20 Gesetz with mean length of 20, and if I choose 2 as my std, I output only those references which are more than 2 std apart from the mean.

The program produces two files in the output folder: outputfile and outputfilewhole.csv

Outputfile consists of annotations for which the above calculation is done at each file level
Outputfilewhole.csv consists of annotations for which the above calculation is done at the entire folder level.

----------------------------------------------------------------------------------------
----------- END OF INFORMATION ABOUT THE PYTHON SCRIPT. NOW THE JAR FILE ---------
----------------------------------------------------------------------------------------

After loading the folder as Maven Project,
1. Export the project to a runnable jar.
1.1 Choose the option: "Package the required libraries into a new folder".
1.2 Give a name to the extracted jar file. In this case, I named it "RefExtractor.jar"
2. This should create a jar file and is available in the system.

P.S: In case rules are modified, follow steps 1 and 2 to create the new JAR File.

Now, to use this JAR file in a pipeline, issue the following command:

java -cp <<path_to_the_library_folder>> -jar RefExtractor.jar <<input_folder_path>> <<output_folder_path>>

Eg: java -cp C:\RefExtractor_lib -jar C:\RefExtractor.jar C:\Rakesh\input C:\Rakesh\output

This will create REF annotations for the files present in the input folder and write those annotations to respective files in the output folder.

NOTE: In case, there are input files which do not contain any annotations, still a respective file is created in the output folder.
One can delete these zero sized files by using the following command:

find . -type f -size 0 -delete

Update: 25-10-2019: Just found out that the RefExtractor_lib folder is missing a jar file and its size is bigger than 25MB.
This is the jar name: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-upstream-tagger-de-hgc-20140827.jar
This is probably not required for the pipeline. However, if required, then I suggest that you build the maven project and create the jar file

About

Information Extraction From Legal Text using Eclipse UIMA RUTA

Readme

Activity

0 stars

1 watching

1 fork

Report repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages