Final project for 601.615 Databases
A database for English words: lexicon, complexity, topics and frequency
-
Download corpora and other data from here
-
The generated sql files for populating the database can be downloaded here:
- Use
CreateDB.sqlfile to construct the tables in your database. (Uncomment the first three line in the file if you want to use the default database name. )
.
├── data
├── sql-populatedb
├── LemmaAndSyn.py
├── ProcessCommonWords.py
- Create the directory for outputing generated
.sqlfile, and run python script for generating sql file that loads table LEMMA, SYN, MEANS, HYPONYM, DERIVED, ANTONYM, MORPH. MORPH exceptions (*.exc) should be underdata/path unless otherwise defined in the script.
mkdir sql-populatedb
python LemmaAndSyn.py
-
Run
sql-populatedb/LoadLemmaAndSyn.sqlto populate databases with words and concepts information. -
Process common word flags with easy/common word list as desired under
data/path. -
Run generated
sql-populatedb/LoadCommonWords.sqlsetisCommonfield in LEMMA table.
-
Use bash and python scripts under corpus_preprocess to preprocess corpora provided in
corpus/folder. Or skip to next step. -
Run the (generated)
sql-populatedb/add-corpus*.sqlfiles to load corpora and their corresponding word counts onto the database. -
Run
LoadTopics.sqlfor the topic information.