This repository contains several scripts developed to process Twitter data to investigate how soil organic content and health are related.
Two different data sources are used:
- Data collected directly from the Twitter API
- Data from the Twitter archives
For the archive data:
- Main: raw_data_processing.R:
- read raw twitter datasets from different sources (Json or csv format)
- clean and standardize to enable a merge
- simple analysis of what the data looks like
- to correct parsing errors found in the csv files derived from the API (cell overlap) use fixed_tweet.sh !!! This script needs to be edited from the command line and NOT from R, as it is dealing with hidden characters !!!
For the collected data via Twitter API:
- Main: automate.R
- runs every week collecting the last 6-9 days of twitter data based on query words from tag_list.csv
- cleans and standarize to enable merge
-
Inititial data exploration:
- Data_viz_script.R: Data visualization and exploration
- Sentiment_test.R: Used to explore text mining options with Archived/json data. Reproducible for the larger merged dataset.
-
More specific exploration and visualizations can be found in the following folders (see their respective README's for more detailed information about specific analyses):
- various way of visualizing the content of tweets by different categoriestweet_content
- attempts to identify what type of content appeals to different user groups influencers
- each of these ^ rely on the functions within text_analysis_functions.R
translation folder contains scripts for translating hindi using google translate via webinterface
pre_processing folder contains scripts for specific tasks (usually run once).