This project includes two main components: a data gathering script that collects content and comments from various online sources (Reddit, Guardian [not fully tested yet], and Lemmy) and a Natural Language Processing (NLP) analysis script that analyzes the gathered data. The analysis includes sentiment analysis, emotion classification, keyword extraction, and visualizations of trends and distributions.
-
Data Gathering:
- Collects posts and comments from Reddit and Lemmy.
- Fetches articles from the Guardian API [
not fully tested yet]. - Supports keyword-based searches with time filters.
-
NLP Analysis:
- Performs sentiment analysis using NLTK and TextBlob.
- Extracts keywords and summarizes text using LSA.
- Classifies emotions using a pre-trained transformer model.
- Generates visualizations for sentiment trends, emotion distributions, and more.
- Python 3.11
- Required Python packages:
prawrequestsnltkpandastextblobsumytransformerstqdmmatplotlibseabornwordcloudjoypypython-dotenv
-
Clone the repository:
git clone https://github.com/MasoudMiM/internet-lens.git cd internet-lens -
Create and Activate virtual environment (optional):
conda env create -f environment.yml conda activate data-gathering-nlp
-
Set up environment variables: Create a
.envfile in the root directory of the project and add the following variables:CLIENT_ID=your_reddit_client_id CLIENT_SECRET=your_reddit_client_secret USER_AGENT=your_user_agent LEMMY_USERNAME=your_lemmy_username LEMMY_PASSWORD=your_lemmy_password GUARDIAN_API_KEY=your_guardian_api_key
-
Run the Data Gathering Script: Execute the first script to gather data from the specified sources:
python data_gathering.py
-
Run the NLP Analysis Script: After gathering data, run the second script to perform NLP analysis:
python nlp_analysis.py
-
Output:
- The gathered data will be saved in the
datadirectory. - The analysis results and visualizations will be saved in the
outputsdirectory.
- The gathered data will be saved in the
The project generates various visualizations, including:
- Trends of posts and comments over time.
- Average sentiment of posts and comments.
- Distribution of emotion scores.
- Word clouds of keywords.
- Stance distribution and discourse complexity.
Contributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or inquiries, please contact [masoumi.masoud@gmail.com].