Course: Data visualizations - Telling stories with R & ggplot
Lectured by Jan Zilinsky, PhD @ Technical University of Munich1
Resources:
- Wilkinson, L. (2005). The Grammar of Graphics (2nd ed.). Springer. https://doi.org/10.1007/0-387-28695-0
- Data Visualization with R and ggplot https://zilinskyjan.github.io/DataViz/
Charts don't need to be ugly. But they don't need to bend the truth either to be attractive. Being exposed to unprofessional and, at times, misleading data visualizations in journalism and on social media has become the norm for critical readers. We took this course to develop our abilities in communicating analytical insights more effectively and appropriately in our work and research activities; stay truthful to the underlying data, and present something insightful that might not have obvious before.
Deciding on a data source has, for once, been a more challenging part. Our first trial in geo-locating Google Timeline data (drawing heatmaps etc.) using library(sf)2 turned out less fruitful than we had hoped. Correlating time series tracking data (#quantified self) with information on the weather or spending behavior is notoriously difficult to implement and didn't show enough promise in rendering novel insights to justify going forward. If you're interested doing something similiar yourself, see Google Takeout and this project.
Next, we stumbled upon data on our very own travel behavior. Booking confirmations by "Deutsche Bahn", a German rail company, deliver adequatly consistent data points on many aspects of a journey and could make us explore relationships between locations, pricing and the individual way of booking that are impossible to gather otherweise. The official API, though extensive, has rather little potential for application on the personal level, so we brainstormed new performance indicators and statistical calculations that might help us better understand and adjust out travel behavior. Unfortunately, making sense of both HTML emails and PDF is easiest using an LLM. While we have been able to give it a try (GPT4o, using Jupyter notebook), we ultimately decided to go with a case project that does't demand our attention in both R and Python at the same time.
We ultimately went with analyzing a large data set on X (Twitter) postings on issues of security that has been gathered recently in our research. Thanks to its time-series form and high number of observations, we will be able to render insights more relevant to the broader public. It may even hold in predicting sentiment of posts on similiar topics.
We included tweets about inner and international security from January 2023 to November 2024. Find the list of queries twikit_query.md. To collect our data, we made use of the Twikit library3 in Python as a free alternative to the official X API. We collected all tweets pertaining to our query per month until all data for the aforementioned time period was collected.
We preprocessed our data in multiple ways.
- First, we anonymized all data obtained, ensuring no personal data remained on our data set.
- Second, we removed all hashtags, links, and mentions, as they would interfere with our sentiment analysis.
- Lastly, we used the
multilingual XLM-roBERTa-Basesentiment model45 to classify all our or tweets into three categories: positive, neutral, or negative.
We aim to understand better the relationships of posts and sentiment, network effects, and discourse intensity on select issues. Calculating margins and relative shares gives us a more nuanced understanding of the distributions.
After gathering our data, we expected to find:
- The discussion on X around security issues to be mostly negative due to the current geopolitical landscape and recent terrorist attacks within Germany.67
- Considerable spikes in activity and negative tweets after significant events (e.g. Hamas attacks on Israel)8 as the data set captures primarily security issues.
- Engagement with tweets on security remaining relatively low (discourse shaped most significantly by a small group).
Important
Results are preliminary. Please do your own calculations before publishing any insights elsewhere.
This data set is highly selective due to the specific research design. Do not generalize.
More charts:
- Macro time-series
- Micro time-series:
- Engagement analytics:
The value of the guiding literature and thinking hard about the data-to-ink ratio cannot be overstated ("Grammar" of graphics)9. We learned to
- define a visualization's success criteria,
- focus on the Why of a chart,
- make aesthetics a domain of our regular charting work in R,
- make use of the full potential of
ggplot210 andtidyverse11, - use Jupyter and GPT4o to analyze HTML documents and output CSV tables.
Jannis Haendke, Sebastián Aguilar (2025)
Liked our work? Drop us a message & let's chat about the next data project!
Footnotes
-
Jan Zilinsky, https://www.janzilinsky.com/ ↩
-
Simple Features for R, https://r-spatial.github.io/sf/ ↩
-
Twikit, https://github.com/d60/twikit ↩
-
Barbieri, F., Anke, L. E., & Camacho-Collados, J. (2021). XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2104.12250 ↩
-
Huggingface model card, https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment ↩
-
May 31st 2024 (Mannheim), https://en.wikipedia.org/wiki/2024_Mannheim_stabbing ↩
-
August 23rd 2024 (Solingen), https://en.wikipedia.org/wiki/2024_Solingen_stabbings ↩
-
October 7th 2023 (Israel) https://en.wikipedia.org/wiki/7_October_Hamas-led_attack_on_Israel ↩
-
Wilkinson, L. (2005). The Grammar of Graphics (2nd ed.). Springer. https://doi.org/10.1007/0-387-28695-0 ↩
-
Tidyverse packages for data science, https://ggplot2.tidyverse.org/ ↩
-
ggplot2 declarative graphics https://www.tidyverse.org/ ↩