Skip to content

Latest commit

 

History

History
43 lines (34 loc) · 4.57 KB

File metadata and controls

43 lines (34 loc) · 4.57 KB

DOI

This repository contains the scripts used in the article "Mapping urban linguistic diversity with social media and population register data" published in Computers, Environment and Urban Systems.

Data requirements

  • Twitter and Instagram data from the Helsinki Metropolitan Area from the year 2015.
    • Instagram data is legacy data we had collected in 2016 before the API was closed down. The Twitter data can be collected with tweetsearcher.
  • Statistics Finland 250 m grid database from year 2015
  • Individual-level first language information aggregated into 250 m grid from 2015
  • Dynamic population data from Helsinki Metropolitan Area by Bergroth et al. (2022) from here

Pre-analysis steps

  • The language detection of the social media was done with fastText using scripts from Hiippala et al. (2020)
  • Linguistic diversity of register data was calculated in the Statistics Finland secure environment FIONA with a similar script to neighborhood_diversities.py.

Suggested running order of scripts

Step Script Description Input Output
1 combine_twitter_insta.py Combines Twitter and Instagram data Instagram and Twitter point features geopackage Twitter-Instagram combined point features geopackage
2 neighborhood_diversities.py Calcualtes linguistic diversities across times of day Output from step 1 and grid database Grid database with diversity metrics
3 neighborhood_diversities_no_timeofday.py Calcualtes linguistic diversities for social media as a whole Output from step 1 and grid database Grid database with diversity metrics
4 clean_socioeco.py Cleans socio-economic grid database Raw RTK database file Cleaned grid database
5 join_dynpop.py Joins dynamic population to output from step 4 Dynamic population data and output 4 Grid database with dynamic population
6 join_regdiv_to_socioeco.py Joins diversity metrics in registry with grid database from output 5 Register data with linguistic diversity metrics and output from step 5 Grid database
7 user_langprofiles.py Calculates social media user linguistic profiles Output from step 1 Latex-formatted table
8 moran_cluster.py Calculates clusters in register and social media grid data Outputs from steps 6 and 3 Geopackage with clusters
9 stability_socioeco.py Classifies social media clusters based on temporal stability Output from step 8 Geopackage with stability classficiations
10 extract_high_clusters.py Extracts significant high linguistic diversity clusters Output from step 9 Geopackage with high diversity clusters
11 kde_plot.py Plots linguistic diversity across times of day and the register data Outputs from steps 6 and 3 PNG file
12 regression_ols_timeofday.py Performs the OLS regression analysis Outputs from steps 6 and 3 Model files, VIF dataframes, error plots
13 SLM regression in GeoDA Run SLM regression Outputs from steps 6 and 3 SLM model summaries
14 OLS_summaries.py Prints OLS summaries OLS model files from step 12 Latex-formatted OLS summaries
15 get_coef_table.py Prints coefficient table from regression analyses Output from step 14 Latex-formatted table
15 read_vif_files.py Prints VIF test scroes VIF dataframes from step 12 Latex-formatted table

Note on analysis

The SLM analysis was conducted in GeoDA with 1st order Queen contiguity neighborhoods.