GitHub - Sahe00/NLPTM_finalproject: Natural Language Processing and Text Mining 521158S course's final project: This project compares Old English and modern English corpora to analyze vocabulary patterns and stopword behavior, testing Heaps’ and Zipf’s Laws and evaluating stopword detection methods against the NLTK English stopword list.

📊 Datasets

This project uses multiple datasets to train, evaluate, and analyze the model. Below is a detailed description of each dataset and its purpose within the project.

1. Old English Corpora

Name: Old English Corpora (Kielipankki / ORACC 2017-09)
Source: [Link to the website]
Description:
Contains:
- Corpus of Ancient Mesopotamian Scholarship (CAMS)
- Digital Corpus of Cuneiform Lexical Texts (DCCLT)
- Royal Inscriptions of Babylonia Online (RIBO)
- Royal Inscriptions of the Neo-Assyrian Period (RINAP)
- State Archives of Assyria Online (SAAO)
Format: VRT
Size: 23 MB

2. Modern English Corpus

Name: BBC News Dataset (Kaggle)
Source: [Link to the website]
Description:
Self updating dataset. It collects RSS Feeds from BBC News using a Kernel: https://www.kaggle.com/gpreda/bbc-news-rss-feeds. The Kernel is run with a fixed frequency and the dataset is updated using the output of the Notebook. BBC News RSS Feeds. The data contains the following columns:
- title
- pubDate
- guid
- link
- description
Format: CSV
Size: 14 MB

💻 Installation and Setup

Follow these steps in your terminal or command prompt:

Clone the Repository Get a local copy of the project files using Git.

git clone https://github.com/Sahe00/NLPTM_finalproject.git
cd NLPTM_finalproject

Create and Activate a Virtual Environment (Recommended) Use a virtual environment to manage dependencies and avoid conflicts.

python -m venv venv
# On Windows (Command Prompt):
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Install Dependencies Install all required libraries listed in requirements.txt:
```
pip install -r requirements.txt
```

🏃 Running the Project

The core analysis is performed within the primary Jupyter Notebook.

Option A: Running via Terminal/Browser (Standard)

Launch Jupyter Lab/Notebook Start the local server from the project root directory:
```
jupyter lab
# OR
jupyter notebook
```
Execute the Code A web browser will open. Navigate to and click on project.ipynb. Run the cells sequentially to reproduce the results.

Option B: Running via Visual Studio Code (IDE)

Open the Project in VS Code Open the NLPTM_finalproject folder in VS Code.
Select the Python Kernel
- Open project.ipynb.
- In the top right corner of the notebook interface, select the Python kernel associated with your newly created venv environment.
- Note: Ensure you have the Python and Jupyter extensions installed in VS Code.
Execute the Code Run the cells one by one or select "Run All" within the notebook interface to perform the analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.ipynb_checkpoints		.ipynb_checkpoints
dataset		dataset
parsed_dataset		parsed_dataset
stopwords_filtering		stopwords_filtering
.gitignore		.gitignore
README.md		README.md
extract_old_english_data.py		extract_old_english_data.py
project.ipynb		project.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Datasets

1. Old English Corpora

2. Modern English Corpus

💻 Installation and Setup

🏃 Running the Project

Option A: Running via Terminal/Browser (Standard)

Option B: Running via Visual Studio Code (IDE)

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Sahe00/NLPTM_finalproject

Folders and files

Latest commit

History

Repository files navigation

📊 Datasets

1. Old English Corpora

2. Modern English Corpus

💻 Installation and Setup

🏃 Running the Project

Option A: Running via Terminal/Browser (Standard)

Option B: Running via Visual Studio Code (IDE)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages