A lightweight, end-to-end data mining pipeline that fetches news articles from the NYTimes API, preprocesses text using Python (Pandas, NLTK, spaCy), and uncovers hidden topics with NMF and KMeans clustering, visualized via insightful word clouds.
- NYTimes API Integration: Retrieve news articles with custom queries and date ranges.
- Secure API Key Management: Utilizes a
.envfile to keep your API key private. - Text Preprocessing: Lowercases, cleans, and removes stopwords from article text.
- NER & Topic Modeling: Extracts named entities and discovers latent topics.
- Clustering: Groups similar articles for deeper analysis.
Run the command below in your terminal
git clone https://github.com/shoibolina/NYTimes-mining.gitCreate a .env file using terminal at the project directory to securely store the API key.
touch .envCreate an account at The New York Times and get your Article Search API Key. Open the previously created .env file and enter your api key as follows:
NYTIMES_API_KEY=your_nytimes_api_keyThis file is included in .gitignore to prevent sharing/committing the api keys.
Install the python libraries listed in Language & Libraries
Follow through the comments in the notebook and have fun exploring!
Shoibolina Kaushik
Master of Science, Computer Science (25G)
Emory University

