Skip to content

Commit a21c727

Browse files
authored
Merge pull request #3 from DigitalGeographyLab/develop
Merging new features into main branch. The only conflict was in README.md, which has been resolved.
2 parents f6f257d + 0fc80dc commit a21c727

5 files changed

Lines changed: 660 additions & 17 deletions

File tree

README.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,12 +68,14 @@ output_params:
6868
For example: The above search config file would search for tweets mentioning snow or rain that have been geotagged in Finland and which are NOT retweets. The time window from which these tweets are searched from is defined when giving the command (see below). The parameters would return maximum of 500 results per call and a maximum of 100 000 tweets. The resulting file would be saved with the prefix `my_weather_search` and one file would contain a maximum of 1 000 000 tweets. If you want to set up a daily collection, remove `start_time` and `end_time` from the config, then the script will collect tweets from yesterday (i.e. the day before the current day).
6969

7070
## Usage
71-
71+
#### Time period collecting
7272
Then just navigate to the cloned repository directory on your local machine and type:
7373
```
7474
python v2_tweets_to_file.py -sd 2020-04-28 -ed 2020-05-29 -o pkl -w 45 -s iterative
7575
```
76-
and you after a while you should start accumulating pickled dataframes (`.pkl` files) one per date, so if you're requesting a full year then you'll be getting 365 files. The `-w` flag indicates the wait time in seconds if rate limit is reached. `iterative` style (the `-s` flag) is good for queries returning large amounts of tweets for each day (e.g. all geotagged tweets within Finland). The resulting files can be combined into one file with `combine_tweets.py` script. For queries returning small per-day tweet amounts use `bulk` style by typing:
76+
77+
and you after a while you should start accumulating pickled dataframes (`.pkl` files) one per date, so if you're requesting a full year then you'll be getting 365 files. The `-w` flag indicates the wait time in seconds if rate limit is reached. `iterative` (the `-s` flag) style is good for queries returning large amounts of tweets for each day (e.g. all geotagged tweets within Finland). The resulting files can be combined into one file with `combine_tweets.py` script. For queries returning small per-day tweet amounts use `bulk` style by typing:
78+
7779

7880
```
7981
python v2_tweets_to_file.py -sd 2020-05-27 -ed 2020-05-29 -o pkl -s bulk
@@ -82,6 +84,24 @@ and you will get just one `.pkl` file. Please note that this `bulk` option is su
8284

8385
Output files by default are pickled pandas dataframes(`.pkl`). They can be read into Python with [Pandas](https://pandas.pydata.org/) library for further processing. Saving to `.csv` files is also supported, but some fields containing data types like `list` and `dict` objects will be converted to plaintext. The flags stand for `sd` = start date, `ed` = end date, `o` = output file format, `w` = wait time in seconds (only for `iterative` style), and `s` = style. Wait time is there to be used if you think you're going to hit the Twitter rate limits when downloading tweets with `iterative`, for example when downloading a full year of geotagged tweets from Finland. *Please note that the end time date **IS NOT** collected, the collection stops at 23:59:59 the previous date, in the example case on the 28th of May at 23:59:59*.
8486

87+
#### Timeline collecting
88+
89+
Use the following command to collect all tweets by users from a specific time period. Requires you to have a csv file with all user ids under a column named `usr_id`. The chunk flag (`-c`) indicates how many users' tweets should be in one `.csv` or `.pkl` file. The default is 20.
90+
```
91+
python timeline_tweets_to_file.py -ul /path/to/list.csv -sd YEAR-MO-DA -ed YEAR-MO-DA -o pkl -op ~/path/to/folder/ -c 50
92+
```
93+
Please note, that this uses the `all` endpoint of the Twitter API v2 and not the `user timeline` endpoint, which only allows to collect 3200 most recent tweets.
94+
95+
#### Bounding box collecting
96+
97+
This collection method requires you to have a bounding box geopackage file, which has been generated with `mmqgis` plugin in QGIS. Please note, the bounding box can *not* be larger than 25 miles by 25 miles. Run this collection method with the following command:
98+
```
99+
python bbox_tweets_to_file.py -sd YEAR-MO-DA -ed YEAR-MO-DA -w 15 -in 20 -b /path/to/bbox.gpkg -o path/to/results/
100+
```
101+
If you are collecting a longer time period from a popular place (like NYC, London, Sydney etc.), please use a larger interval number (`-in`). This ensures your collection runs faster, hits less rate limits, and has less chance of running out of memory. For instance, a 25 by 25 mile box from a popular place during Twitter's heydays (2014-2017) will easily return more than 150 000 tweets per month.
102+
103+
#### Converting to geopackage
104+
85105
If you downloaded with `iterative` style, you might want to combine the pickled dataframes to one big file. You can do this with `combine_tweets.py`. It supports saving to a [GeoPackage](https://www.geopackage.org/) file (a common spatial file format like shapefile), a pickled Pandas dataframe and a plain csv file. Combining tweets from `.csv` files hasn't been implemented yet as `csv` files do not retain data types. To combine tweets run the following command in the directory where you have the `.pkl` files:
86106

87107
```
@@ -103,7 +123,7 @@ If you're not interested in what bots have to say, then you have to do the clean
103123
# Known issues
104124
This tool is in very early stages of development and issues can arise if downloading very small datasets. Use the bulk option for small datasets.
105125

106-
This tool has been tested on Linux (specifically Ubuntu 18.04, 20.04 and Manjaro 21.0.3). Confirmed to work on MacOS Catalina+ and Windows 10, FreeBSD should work too.
126+
This tool has been tested on Linux (specifically Ubuntu 18.04, 20.04, 22.04, and Manjaro 21.0.3). Confirmed to work on MacOS Catalina+ and Windows 10.
107127

108128
Please report further issues and/or submit pull requests with fixes.
109129

bbox_tweets_to_file.py

Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
#!/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
"""
4+
Created on Wed Feb 1 14:44:24 2023
5+
6+
INFO
7+
####
8+
9+
This script downloads Tweets from the full archive of Twitter using the academic
10+
access API. It downloads geotagged tweets based on a list of bounding boxes.
11+
The outputs are saved as a pickled dataframe (.pkl).
12+
13+
The bounding boxes can not be larger than 25 miles by 25 miles or Twitter API
14+
will not process the request. Please only use WGS-84 coordinates.
15+
16+
Matches against the place.geo.coordinates object of the Tweet when present,
17+
and in Twitter, against a place geo polygon, where the place polygon is fully
18+
contained within the defined region. So, you are more likely to get
19+
coordinate-tagged than place-tagged content with this approach.
20+
21+
To construtct a GIS file out of them, please run the tweets_to_gpkg.py in the
22+
folder containing the pickled dataframe files (.pkl files)
23+
24+
REQUIREMENTS
25+
############
26+
27+
Files:
28+
.twitter_keys.yaml in the script directory
29+
premsearch_config.yaml in the script directory
30+
a bounding box grid geopackage file created with mmqgis plugin in QGIS.
31+
32+
Installed:
33+
Python 3.8 or newer
34+
35+
Python packages:
36+
searchtweetsv2
37+
pandas
38+
geopandas
39+
40+
USAGE
41+
#####
42+
43+
Run the script by typing:
44+
45+
python bbox_tweets_to_file.py -sd YEAR-MO-DA -ed YEAR-MO-DA -w 15 -in 20 -b /path/to/bbox.gpkg -o path/to/results/
46+
47+
Replace YEAR with the year you want, MO with the month you want and DA with the
48+
day of the month you want.
49+
50+
Interval represents the number of divisions for the time period, as for popular
51+
areas it makes no sense to collect all tweets from 2009 to 2023 in one go:
52+
rate limits are exceeded instantly. Collect 1,2 or 3 weeks/months/yearsat a time
53+
depending on the local context. From NYC, collect using 2-week intervals, but
54+
from an unpopulated wilderness or rural area collect 9 months -- 2 years at a
55+
time.
56+
57+
NOTE
58+
####
59+
60+
The collector collects tweets starting from 00:00 hours on the starting day and
61+
ends the collection on 23:59:59 on the day before the end date. In the example
62+
above the last collected day would be 2019-06-14.
63+
64+
The bounding box coordinates in the query require only two (2) coordinate pairs
65+
first representing southwest corner of the bounding box, then northeast corner.
66+
The CRS is basic WGS-84 in decimal degrees.
67+
68+
This script assumes you have created the bounding box with MMQGIS plugin in QGIS.
69+
70+
71+
### SYDNEY SPECIFIC ###
72+
No geotagged tweets before 01.09.2010
73+
74+
Collection ends February 1st 2023
75+
12 years 5 months =
76+
77+
@author: Tuomas Väisänen & Seija Sirkiä
78+
"""
79+
80+
from util_functions import v2parser, daterange
81+
from searchtweets import ResultStream, gen_request_parameters, load_credentials, read_config
82+
from datetime import datetime
83+
import geopandas as gpd
84+
import time
85+
import argparse
86+
import gc
87+
88+
# Set up the argument parser
89+
ap = argparse.ArgumentParser()
90+
91+
# Get starting date
92+
ap.add_argument("-sd", "--startdate", required=True,
93+
type=lambda s: datetime.strptime(s, '%Y-%m-%d'),
94+
help="Start date of the collection in the following form: "
95+
" YEAR-MO-DA for example 2018-01-28")
96+
97+
# Get end date
98+
ap.add_argument("-ed", "--enddate", required=True,
99+
type=lambda s: datetime.strptime(s, '%Y-%m-%d'),
100+
help="End date of the collection in the following form: "
101+
" YEAR-MO-DA for example 2018-02-18")
102+
103+
# get wait time
104+
ap.add_argument("-w", "--wait", required=False, default=15,
105+
help="Set wait time between requests to avoid Twitter rate limits. "
106+
"Default: 15")
107+
108+
# get interval
109+
ap.add_argument("-in", "--interval", required=True, default=1,
110+
help="Set date intervals to avoid rate limits with popular areas. "
111+
"Default: 1, which is no intervals but everything all at once.")
112+
113+
# get bounding box geopackage
114+
ap.add_argument("-b", "--bbox", required=True,
115+
help="Path to bounding box geopackage. For example: "
116+
"~/Data/project/bbox.gpkg")
117+
118+
# get bounding box geopackage
119+
ap.add_argument("-o", "--output", required=True,
120+
help="Path to output folder. For example: "
121+
"~/Data/project/results/")
122+
123+
# Parse arguments
124+
args = vars(ap.parse_args())
125+
126+
# get waittime and interval
127+
waittime = int(args['wait'])
128+
interval = int(args['interval'])
129+
130+
# get output path
131+
outpath = args['output']
132+
133+
# load bounding box
134+
bbox_df = gpd.read_file(args['bbox'], driver='GPKG')
135+
136+
# get bbox order
137+
bbox_df = bbox_df.assign(row_number=range(len(bbox_df)))
138+
139+
# load twitter keys
140+
twitter_creds = load_credentials('.twitter_keys.yaml',
141+
yaml_key = 'search_tweets_v2',
142+
env_overwrite = False)
143+
144+
# load configuration for search query
145+
search_config = read_config('search_config.yaml')
146+
147+
# fields for v2 api
148+
tweetfields = ",".join(["attachments", "author_id", "conversation_id", "created_at",
149+
"entities", "geo", "id", "in_reply_to_user_id", "lang",
150+
"public_metrics", "possibly_sensitive", "referenced_tweets",
151+
"reply_settings", "text", "withheld",])
152+
userfields = ",".join(["created_at", "description", "entities", "location",
153+
"name", "profile_image_url", "protected", "public_metrics",
154+
"url", "username", "verified", "withheld"])
155+
mediafields= ",".join(["media_key", "type", "url"])
156+
placefields = ",".join(["contained_within", "country", "country_code", "full_name",
157+
"geo", "id", "name", "place_type"])
158+
expansions = ",".join(["attachments.media_keys", "author_id", "entities.mentions.username",
159+
"geo.place_id", "in_reply_to_user_id", "referenced_tweets.id",
160+
"referenced_tweets.id.author_id"])
161+
162+
# get date interval
163+
start_date = args['startdate'].date()
164+
end_date = args['enddate'].date()
165+
166+
# get the amount of time per date intervals for looping
167+
diff = (end_date - start_date) / interval
168+
169+
# loop over date intervals
170+
for intv in range(interval):
171+
172+
# get interval start date
173+
intstart = start_date + diff * intv
174+
175+
# get interval end index
176+
intend_ix = intv + 1
177+
178+
# get interval end date
179+
intend = start_date + diff * intend_ix
180+
181+
# print message about which interval is collected
182+
print('[INFO] - Starting tweet collection between ' + str(intstart) + ' - ' + str(intend))
183+
184+
# empty tweet list for current interval
185+
tweets_interval = []
186+
187+
# loop over bounding boxes
188+
for i, bbox in bbox_df.iterrows():
189+
190+
# extract southwest corner coordinate points
191+
west = bbox['left']
192+
south = bbox['bottom']
193+
194+
# extract northeast corner coordinate points
195+
north = bbox['top']
196+
east = bbox['right']
197+
198+
# form the search query based on bounding box southwest and northeast corner coordinates
199+
search_q = f'bounding_box:[{west:.5f} {south:.5f} {east:.5f} {north:.5f}] -is:retweet -is:quote -is:reply'
200+
201+
# generate payload rules for v2 api
202+
rule = gen_request_parameters(query = search_q,
203+
results_per_call = search_config['results_per_call'],
204+
start_time = intstart.isoformat(),
205+
end_time = intend.isoformat(),
206+
tweet_fields = tweetfields,
207+
user_fields = userfields,
208+
media_fields = mediafields,
209+
place_fields = placefields,
210+
expansions = expansions,
211+
stringify = False)
212+
213+
# initiate result stream from twitter v2 api
214+
rs = ResultStream(request_parameters = rule,
215+
max_results=100000,
216+
max_pages=1,
217+
max_tweets = search_config['max_tweets'],
218+
**twitter_creds)
219+
220+
# number of reconnection tries
221+
tries = 10
222+
223+
# while loop to protect against 104 error
224+
while True:
225+
tries -= 1
226+
227+
# attempt retrieving tweets
228+
try:
229+
# indicate which day is getting retrieved
230+
print('[INFO] - Searching for tweets between ' + str(intstart) + ' and ' + str(intend) + ' from bounding box ' + str(i))
231+
232+
# get json response to list
233+
tweets = list(rs.stream())
234+
235+
# print response
236+
print('[INFO] - Got {} tweets from bounding box {}'.format(str(len(tweets)), str(i)))
237+
238+
# check if size warrants shorter or longer wait time
239+
if len(tweets) < 500:
240+
241+
# wait 8 seconds to avoid request bombing in case of zero or a few tweets
242+
time.sleep(18)
243+
244+
else:
245+
246+
# wait a bit longer
247+
time.sleep(waittime)
248+
249+
# break free from while loop
250+
break
251+
252+
# catch exceptions
253+
except Exception as err:
254+
if tries == 0:
255+
raise err
256+
else:
257+
print('[INFO] - Got connection error, waiting ' + str(waittime) + ' seconds and trying again. ' + str(tries) + ' tries left.')
258+
time.sleep(waittime)
259+
260+
# extend current interval tweet list with tweets from current bounding box
261+
tweets_interval.extend(tweets)
262+
263+
# run garbage collector to free some memory
264+
gc.collect()
265+
266+
# collect loose garbage
267+
gc.collect()
268+
269+
# check if there are results
270+
if len(tweets_interval) != 0:
271+
272+
# parse results to dataframe
273+
print('[INFO] - Parsing collected tweets from ' + str(intstart) + ' to ' + str(intend))
274+
tweetdf = v2parser(tweets_interval, search_config['results_per_call'])
275+
276+
# try to order columns semantically
277+
try:
278+
tweetdf = tweetdf[['id', 'author_id', 'created_at', 'conversation_id',
279+
'in_reply_to_user_id', 'text', 'lang',
280+
'public_metrics.retweet_count',
281+
'public_metrics.reply_count', 'public_metrics.like_count',
282+
'public_metrics.quote_count', 'user.location',
283+
'user.created_at', 'user.username',
284+
'user.public_metrics.followers_count',
285+
'user.public_metrics.following_count',
286+
'user.public_metrics.tweet_count',
287+
'geo.place_id', 'geo.coordinates.type',
288+
'geo.coordinates.coordinates',
289+
'geo.coordinates.x', 'geo.coordinates.y', 'geo.full_name',
290+
'geo.name', 'geo.place_type', 'geo.country',
291+
'geo.country_code', 'geo.type', 'geo.bbox',
292+
'geo.centroid', 'geo.centroid.x', 'geo.centroid.y']]
293+
except:
294+
295+
pass
296+
297+
# set up file prefix from config
298+
file_prefix_w_date = search_config['filename_prefix'] + '_' + str(intstart) + '---' + str(intend)
299+
outpickle = file_prefix_w_date + '_part' + str(intv) + '.pkl'
300+
301+
# save to pickle
302+
tweetdf.to_pickle(outpath + outpickle)
303+
print('[INFO] - Dataframe saved.')
304+
305+
# collect loose garbage to free memory
306+
gc.collect()
307+
308+
else:
309+
# print message and move to next bbox
310+
print('[INFO] - No geotagged tweets in bounding boxes between {} and {}. Moving on...'.format(str(start_date), str(end_date)))
311+
gc.collect()
312+
pass
313+
314+
print('[INFO] - ... done!')

0 commit comments

Comments
 (0)