You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-3Lines changed: 23 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -68,12 +68,14 @@ output_params:
68
68
For example: The above search config file would search for tweets mentioning snow or rain that have been geotagged in Finland and which are NOT retweets. The time window from which these tweets are searched from is defined when giving the command (see below). The parameters would return maximum of 500 results per call and a maximum of 100 000 tweets. The resulting file would be saved with the prefix `my_weather_search` and one file would contain a maximum of 1 000 000 tweets. If you want to set up a daily collection, remove `start_time` and `end_time` from the config, then the script will collect tweets from yesterday (i.e. the day before the current day).
69
69
70
70
## Usage
71
-
71
+
#### Time period collecting
72
72
Then just navigate to the cloned repository directory on your local machine and type:
and you after a while you should start accumulating pickled dataframes (`.pkl` files) one per date, so if you're requesting a full year then you'll be getting 365 files. The `-w` flag indicates the wait time in seconds if rate limit is reached. `iterative` style (the `-s` flag) is good for queries returning large amounts of tweets for each day (e.g. all geotagged tweets within Finland). The resulting files can be combined into one file with `combine_tweets.py` script. For queries returning small per-day tweet amounts use `bulk` style by typing:
76
+
77
+
and you after a while you should start accumulating pickled dataframes (`.pkl` files) one per date, so if you're requesting a full year then you'll be getting 365 files. The `-w` flag indicates the wait time in seconds if rate limit is reached. `iterative` (the `-s` flag) style is good for queries returning large amounts of tweets for each day (e.g. all geotagged tweets within Finland). The resulting files can be combined into one file with `combine_tweets.py` script. For queries returning small per-day tweet amounts use `bulk` style by typing:
@@ -82,6 +84,24 @@ and you will get just one `.pkl` file. Please note that this `bulk` option is su
82
84
83
85
Output files by default are pickled pandas dataframes(`.pkl`). They can be read into Python with [Pandas](https://pandas.pydata.org/) library for further processing. Saving to `.csv` files is also supported, but some fields containing data types like `list` and `dict` objects will be converted to plaintext. The flags stand for `sd` = start date, `ed` = end date, `o` = output file format, `w` = wait time in seconds (only for `iterative` style), and `s` = style. Wait time is there to be used if you think you're going to hit the Twitter rate limits when downloading tweets with `iterative`, for example when downloading a full year of geotagged tweets from Finland. *Please note that the end time date **IS NOT** collected, the collection stops at 23:59:59 the previous date, in the example case on the 28th of May at 23:59:59*.
84
86
87
+
#### Timeline collecting
88
+
89
+
Use the following command to collect all tweets by users from a specific time period. Requires you to have a csv file with all user ids under a column named `usr_id`. The chunk flag (`-c`) indicates how many users' tweets should be in one `.csv` or `.pkl` file. The default is 20.
Please note, that this uses the `all` endpoint of the Twitter API v2 and not the `user timeline` endpoint, which only allows to collect 3200 most recent tweets.
94
+
95
+
#### Bounding box collecting
96
+
97
+
This collection method requires you to have a bounding box geopackage file, which has been generated with `mmqgis` plugin in QGIS. Please note, the bounding box can *not* be larger than 25 miles by 25 miles. Run this collection method with the following command:
If you are collecting a longer time period from a popular place (like NYC, London, Sydney etc.), please use a larger interval number (`-in`). This ensures your collection runs faster, hits less rate limits, and has less chance of running out of memory. For instance, a 25 by 25 mile box from a popular place during Twitter's heydays (2014-2017) will easily return more than 150 000 tweets per month.
102
+
103
+
#### Converting to geopackage
104
+
85
105
If you downloaded with `iterative` style, you might want to combine the pickled dataframes to one big file. You can do this with `combine_tweets.py`. It supports saving to a [GeoPackage](https://www.geopackage.org/) file (a common spatial file format like shapefile), a pickled Pandas dataframe and a plain csv file. Combining tweets from `.csv` files hasn't been implemented yet as `csv` files do not retain data types. To combine tweets run the following command in the directory where you have the `.pkl` files:
86
106
87
107
```
@@ -103,7 +123,7 @@ If you're not interested in what bots have to say, then you have to do the clean
103
123
# Known issues
104
124
This tool is in very early stages of development and issues can arise if downloading very small datasets. Use the bulk option for small datasets.
105
125
106
-
This tool has been tested on Linux (specifically Ubuntu 18.04, 20.04and Manjaro 21.0.3). Confirmed to work on MacOS Catalina+ and Windows 10, FreeBSD should work too.
126
+
This tool has been tested on Linux (specifically Ubuntu 18.04, 20.04, 22.04, and Manjaro 21.0.3). Confirmed to work on MacOS Catalina+ and Windows 10.
107
127
108
128
Please report further issues and/or submit pull requests with fixes.
0 commit comments