-
Notifications
You must be signed in to change notification settings - Fork 6
PDF to NER Update
Keith Alcock edited this page Apr 17, 2023
·
38 revisions
The files recorded in the table below correspond to key stages of the process described subsequently.
| Stage | File | Notes |
|---|---|---|
| 1 | SAED100.pdf.zip | Original PDF files |
| 2 | SAED100.txt.zip | Text files after conversion from PDF |
| 4 | SAED100.out.zip | Annotated files with named entities and all sentences |
| 13 | baseline_non_entities.csv | List of non-entities |
| 14 | SAED100.conll.uncorrected | Single annotated file in CONLL format with only sentences needing correction |
| 14 | SAED100.conll | The same with the actual corrections for retraining |
The steps below are needed in order to update Named Entity Recognition (NER) because of either new data from an updated collection of PDFs (i.e., data changes) or updated procedures for extracting named entities (i.e., code changes):
- Collect the PDFs and place them in
../corpora/SAED100/pdf, for example. - Convert them to text with ScienceParseApp from the pdf2txt project. The most recent update was performed using code from commit 70e559. Arguments used were
org.clulab.pdf2txt.apps.ScienceParseApp -in ../corpora/SAED100/pdf -out ../corpora/SAED100/txt -case false. Case is not corrected here. - On the text files, run (e.g., start
sbtand use therunMaincommand with the name of the main class and any other arguments following) ConditionalCaseMultipleFileApp from this habitus project, possibly from commit 5b843b*:org.clulab.habitus.apps.ConditionalCaseMultipleFileApp ../corpora/SAED100/txtThis converts each text file to two files, one with name ending in.txt.preservedand another in.txt.restored. We're interested in the latter. So far these are all still text files, but they have been tokenized. *This commit made use of a processors 8.5.2-SNAPSHOT, which was locally built (sbt publishLocal) from c546da9. To get the same results as in the next step, one would need to use the same version of processors. - From this same project, run LexiconNerMultipleFile
org.clulab.habitus.entitynetworks.LexiconNerMultipleFile ../corpora/SAED100/txt .txt.restored. This annotates all the sentences from the case-corrected text files and outputs them in tab-separated files containing the words and entities (in BIO notation). Sentences are separated by blank lines in files with the extension.txt.restored.out. - The next steps use code from the Habitus-SRE project which can be downloaded with
git clone github.com/picsolab/Habitus-SRE. Move or copy the folder../corpora/SAED100/txtto./report_v5because the programs expect files to be in a subdirectory of the project and some names in scripts are hard-coded. The project includes some files that will be regenerated with these instructions, so it is best to either move or remove them so that you know they have been reproduced by the end of the procedure.rm data/*_report_v5.csvrm -r metrics_graph_json_report_v5rm error_analysis/*.csvrm coranking/*.csv
- You will probably need to install an executable program called
malletplus a few Python packages withpipas well as R libraries withRStudiowhen they are found to be missing. Note that Python maybe installed aspython3on your system.- Install mallet.
-
Download and unzip. These instructions assume that the executables are available at
./mallet-2.0.8/bin. - You may need to set the environment variable
MALLET_HOME.
-
Download and unzip. These instructions assume that the executables are available at
- Use
pip installfor these Python libraries:- networkx
- nltk
- pyvis
- spacy
- gensim (use
pip install gensim==3.8.3) - matplotlib
- pandas
- tqdm
- sklearn
- From within Python, download additional parts of nltk.
-
import nltk; nltk.download('popular'); exit()
-
- From the command line, download additional parts of spacy.
python -m spacy download en_core_web_lg
- RStudio should offer to install packages automatically when you open these files manually:
- animate_importance.Rmd
- R/compute_metrics.R
- R/get_wrong_pred.R
- Install mallet.
- Run get_sent_ner.py:
python get_sent_ner.py --folder report_v5. It converts the individual.txt.restored.outfiles to a single file,data/sent_ner_report_v5.csv, with these changes: a header line is added; the tabs are converted to commas appropriate for a csv file; sentences are not separated by blank lines; specific named entity labels PER, ORG, and ACRONYM are replaced with a generic ANIMATE. - The next step involves get_graph.py. Run
python get_graph.py --file data/sent_ner_report_v5.csv. This will produce filesdata/entityID_text_report_v5.csvanddata/sentID_text_report_v5.csvas well as directoriesgraph_json_report_v5andgraph_viz_report_v5containing 12 files each. -
topic_model.py trains a model with command
python topic_model.py --file data/sentID_text_report_v5.csv --mallet ./mallet-2.0.8/bin/mallet. This step may not work under Windows. The filedata/topic_model_report_v5.csvis produced as well as the imageelbow_chart.png. - Next run hetero-network-embedding.py with the command
python hetero-network-embedding.py --folder report_v5to producecoranking/weight_Wvs_report_v5.csvandcoranking/Wvs_report_v5.csv. - Run the program get_coranking_graph.py with the command
python get_coranking_graph.py --folder report_v5to produce four reports in the directorygraph_json_report_v5with the.jsonextension and and four more in directorygraph_viz_report_v5with the.htmlextension:lda_ordered_top10_all,lda_ordered_top1_all.lda_top10_all, andlda_top1_all. - Metrics are calculated with get_metrics.py. The command is
python get_metrics.py --folder graph_json_report_v5. However, first increase the value formax_iterin one line of code to 5000 likeeig_cen = pd.DataFrame.from_dict(nx.eigenvector_centrality(G, max_iter=5000).items()). The command creates a new directorymetrics_graph_json_report_v5containing a large number of files (96). - In
RStudioopen animate_importance.Rmd and run all cells. An error analysis should be produced in fileserror_analysis/baseline_non_entities.csv,error_analysis/baseline_pred_all.csv,error_analysis/coranking_non_entities.csv, anderror_analysis/coranking_pred_all.csv. For us, the first is the pertinent one. - Finally, switch back to this habitus project and run ExportNamedEntities2App on the file
error_analysis/baseline_non_entities.csvand the directory of.txt.restored.outfiles to produceSAED100.conllfor training. Depending your directory structure, the command arguments to the App may be../corpora/SAED100/baseline_non_entities.csv ../corpora/SAED100/txt ../corpora/SAED100/SAED100.conll. The filesSAED100.conllandSAED100.conll.uncorrectedshould be produced.
- Datasets
- Grid
- Habitus Application
- Other