-
Notifications
You must be signed in to change notification settings - Fork 23
File Structure and Debugging
The current file structure of a fully setup STACK instance looks as follows:
/stack
/app
/facebook
__init__.py
collect.py
facebook.py
insert.py
process.py
/static
...
/templates
...
/twitter
__init__.py
mongoBatchInsert.py
platform.ini
preprocess.py
ThreadedCollector.py
tweetprocessing.py
tweetstream.py (deprecated)
__init__.py
controller.py
decorators.py
forms.py
models.py
processes.py
tasks.py
views.py
/data
/[project-name]-[project-id]
/facebook
/archive
...
/error
...
/queue
...
/raw
...
/twitter
/archive
...
/insert_queue
...
/raw
...
/out
/[project-name]-[project-id]
/logs
...
/std
...
/pid
...
.gitignore
__main__.py
config.py
install
INSTALL.md
LICENSE
README.md
requirements.txt
run.py
run.wsgi
license.txt
The directories signaled with '...' include important log files that can be referenced by researchers. See below for details on each.
This directory simply contains the setup.py and command line wrapper script main.py.
Contains the Controller and DB wrappers that interface between the user on the command line and the backend processes that collect and process data. Learn more in Interacting w/ STACK.
The 'out' directory contains our, in, and error output text files. Since STACK processes run as daemons, any information 'printed' to the console will be logged here. You can review console outputs and error messages for collectors, processors, and inserter processes via these files.
Collector Outfile Naming Conventions:
[project_name]-[collector_name]-collector-[network]-[api]-in|out|err-[collector_id].txt
Ex.:
test-testcollector-collector-twitter-track-in-1234.txt
WHERE
- project_name - test
- collector_name - testcollector
- network - twitter
- api - track
- collector_id - 1234
_Processor/Inserter Outfile Naming Conventions:
[project_name]-processor|inserter-[network]-in|out|err-[project_id].txt
Ex.:
test-processor-twitter-out-1234.txt
WHERE
- project_name - test
- network - twitter
- project_id - 1234
This directory contains the scripts called by collectors, processors, and inserters for the Twitter module when scraping from the Streaming API.
The 'logs' directory contains log files for all Twitter collection, processing, and insertion processes. The information contained here and naming conventions are very similar to the information logged in the '/stack/stack/out' directory detailed above.
Collector Outfile Naming Conventions:
[project_name]-[collector_name]-[api]-collector-log-[collector_id].log
Ex.:
test-testcollector-track-collector-log-1234.log
WHERE
- project_name - test
- collector_name - testcollector
- api - track
- collector_id - 1234
_Processor/Inserter Outfile Naming Conventions:
[project_name]-processor|inserter-log-[project_id].log
Ex.:
test-processor-log-1234.log
WHERE
- project_name - test
- project_id - 1234
These three directories store raw data files during collection, post collection, and prior to insertion. Each account and/or collector has their own unique directory. Thus, similar to the unique file naming conventions, there are unique directory naming conventions in place here as well (see below).
Stores all raw data files during the collection process. Once processed, these files are deleted and moved to the '/tweet_archive' directory.
Directory Naming Convention:
/raw_tweets_[project_id]
Ex.:
/raw_tweets_1234
Raw Tweet File Naming Convention:
[yyyymmdd]-[hour]-[collector_name]-[project_id]-[collector_id]-tweets_out.json
Ex.:
20150101-15-test-1234-4567-tweets_out.json
WHERE
- date - 1/1/2014
- hour - 3 PM (15h)
- collector_name - test
- project_id - 1234
- collector_id - 4567
The '/tweet_archive' directory contains raw data files that have been processed. These files are stored permanently, even after insertion. In turn, the '/insert_queue' directory contains files only when waiting to be inserted; they are deleted upon insertion. The naming conventions for both are the same.
Directory Naming Convention:
/tweet_archive_[project_id]
OR
/insert_queue_[project_id]
Processed Tweet File Naming Convention:
[yyyymmdd]-[hour]-[collector_name]-[project_id]-[collector_id]-tweets_out_processed.json
Ex.:
20150101-15-test-1234-4567-tweets_out_processed.json
WHERE
- date - 1/1/2014
- hour - 3 PM (15h)
- collector_name - test
- project_id - 1234
- collector_id - 4567
These two directories are used to store any tweets that could not be processed ('/error_tweets') or inserted into the MongoDB data storage database ('/error_inserted_tweets').
Error File Naming Convention:
error_tweet-[project_name]-[project_id].txt
OR
error_inserted_tweet-[project_name]-[project_id].txt
Ex.:
error_tweet-test-1234.txt
WHERE
- project_name - test
- project_id - 1234
Many types of errors can occur when collecting social data, whether it be with an API connection or during processing and insertion. Below we have detailed the errors we have encountered and how our application handles them, in addition to how how to debug each.
In short, we maintain and adapt our connection for the following errors:
- SSL Errors
- Processing Exceptions
- Stream Rate Limiting
- 420 Rate Limits & 503 Service Unavailable error codes
For all other errors, we are required to disconnect in full:
- Timeout Errors
- None 420/503 error codes
- Twitter disconnect messages
Stream Connection Errors - Errors that sometimes occur when maintaining a persistent HTTP connection to a social API.
-
Timeout Errors: Occurs when our HTTP connection times out for an unknown reason
- Solution - We need to disconnect and shut down
- Debugging - Timeout errors are logged in the log files in the /stack/stack/twitter/logs directory
-
SSL Errors: A security issues with our connection
- Solution - Decrementally attempt to reconnect until successful
- Debugging - SSL errors are logged in the /stack/stack/twitter/logs directory
Processing Exceptions: Raw data cannot be processed successfully for some reason, often due to an encoding glitch.
-
Raw Data: Raw data from the social API cannot be written to a raw data file
- Solution - Raw data is logged in the /stack/stack/twitter/logs dir
-
Processing/Insertion Error: An item in a raw data file cannot be processed and/or inserted into the database.
- Solution - The item is written to an error file (/stack/stack/twitter/error_tweets|error_inserted_tweets)
Twitter Error Messages: An error occurs with our connection to the Twitter Streaming API. You can review connections Twitter error codes here.
-
Stream Rate Limiting: Occurs when a given collection exceeds the maximum amount of data that can be retrieved through free use of the Streaming API (1% of total Twitter traffic.)
- Solution - Continue as normal, and log the amount of tweets missed
- Debugging - Logged amounts are in the MongoDB project config database
-
Rate Limiting / Service Unavailable: Occurs when our connection is temporarily paused (rate limited) or unavailable.
- Solution - Decrementally retry our connection until successful
- Debugging - Each retry is logged in both log files (/stack/stack/twitter/logs) and the daemon out files (/stack/stack/out)
-
All Other Error Codes: All other error codes are unfortunately fatal and require a full disconnect
- Solution - Disconnect in full from the Streaming API
- Debugging - Disconnect process is logged in both log files (/stack/stack/twitter/logs) and the daemon out files (/stack/stack/out)
Twitter Disconnect Messages: Sometimes Twitter explicitly asks us to disconnect from the API. Disconnection messages are listed here.
- Solution - Disconnect in full from the Streaming API
- Debugging - Disconnect process is logged in both log files (/stack/stack/twitter/logs) and the daemon out files (/stack/stack/out)