Skip to content

File Structure and Debugging

Billy Ceskavich edited this page Dec 28, 2015 · 3 revisions

File Structure

The current file structure of a fully setup STACK instance looks as follows:

/stack
  /app
    /facebook
      __init__.py
      collect.py
      facebook.py
      insert.py
      process.py
    /static
      ...
    /templates
      ...
    /twitter
      __init__.py
      mongoBatchInsert.py
      platform.ini
      preprocess.py
      ThreadedCollector.py
      tweetprocessing.py
      tweetstream.py (deprecated)
    __init__.py
    controller.py
    decorators.py
    forms.py
    models.py
    processes.py
    tasks.py
    views.py
  /data
    /[project-name]-[project-id]
      /facebook
        /archive
          ...
        /error
          ...
        /queue
          ...
        /raw
          ...
      /twitter
        /archive
          ...
        /insert_queue
          ...
        /raw
          ...
  /out
    /[project-name]-[project-id]
      /logs
        ...
      /std
        ...
      /pid
        ...
  .gitignore
  __main__.py
  config.py
  install
  INSTALL.md
  LICENSE
  README.md
  requirements.txt
  run.py
  run.wsgi
  license.txt

The directories signaled with '...' include important log files that can be referenced by researchers. See below for details on each.

stack/

This directory simply contains the setup.py and command line wrapper script main.py.

stack/stack/

Contains the Controller and DB wrappers that interface between the user on the command line and the backend processes that collect and process data. Learn more in Interacting w/ STACK.

stack/stack/out

The 'out' directory contains our, in, and error output text files. Since STACK processes run as daemons, any information 'printed' to the console will be logged here. You can review console outputs and error messages for collectors, processors, and inserter processes via these files.

Collector Outfile Naming Conventions:

[project_name]-[collector_name]-collector-[network]-[api]-in|out|err-[collector_id].txt

Ex.:

test-testcollector-collector-twitter-track-in-1234.txt

WHERE

  • project_name - test
  • collector_name - testcollector
  • network - twitter
  • api - track
  • collector_id - 1234

_Processor/Inserter Outfile Naming Conventions:

[project_name]-processor|inserter-[network]-in|out|err-[project_id].txt

Ex.:

test-processor-twitter-out-1234.txt

WHERE

  • project_name - test
  • network - twitter
  • project_id - 1234

stack/stack/twitter

This directory contains the scripts called by collectors, processors, and inserters for the Twitter module when scraping from the Streaming API.

stack/stack/twitter/logs

The 'logs' directory contains log files for all Twitter collection, processing, and insertion processes. The information contained here and naming conventions are very similar to the information logged in the '/stack/stack/out' directory detailed above.

Collector Outfile Naming Conventions:

[project_name]-[collector_name]-[api]-collector-log-[collector_id].log

Ex.:

test-testcollector-track-collector-log-1234.log

WHERE

  • project_name - test
  • collector_name - testcollector
  • api - track
  • collector_id - 1234

_Processor/Inserter Outfile Naming Conventions:

[project_name]-processor|inserter-log-[project_id].log

Ex.:

test-processor-log-1234.log

WHERE

  • project_name - test
  • project_id - 1234

stack/stack/twitter/insert_queue, /raw_tweets, /tweet_archive

These three directories store raw data files during collection, post collection, and prior to insertion. Each account and/or collector has their own unique directory. Thus, similar to the unique file naming conventions, there are unique directory naming conventions in place here as well (see below).

/raw_tweets

Stores all raw data files during the collection process. Once processed, these files are deleted and moved to the '/tweet_archive' directory.

Directory Naming Convention:

/raw_tweets_[project_id]

Ex.:

/raw_tweets_1234

Raw Tweet File Naming Convention:

[yyyymmdd]-[hour]-[collector_name]-[project_id]-[collector_id]-tweets_out.json

Ex.:

20150101-15-test-1234-4567-tweets_out.json

WHERE

  • date - 1/1/2014
  • hour - 3 PM (15h)
  • collector_name - test
  • project_id - 1234
  • collector_id - 4567

/tweet_archive & /insert_queue

The '/tweet_archive' directory contains raw data files that have been processed. These files are stored permanently, even after insertion. In turn, the '/insert_queue' directory contains files only when waiting to be inserted; they are deleted upon insertion. The naming conventions for both are the same.

Directory Naming Convention:

/tweet_archive_[project_id]
OR
/insert_queue_[project_id]

Processed Tweet File Naming Convention:

[yyyymmdd]-[hour]-[collector_name]-[project_id]-[collector_id]-tweets_out_processed.json

Ex.:

20150101-15-test-1234-4567-tweets_out_processed.json

WHERE

  • date - 1/1/2014
  • hour - 3 PM (15h)
  • collector_name - test
  • project_id - 1234
  • collector_id - 4567

stack/stack/twitter/error_inserted_tweets, /error_tweets

These two directories are used to store any tweets that could not be processed ('/error_tweets') or inserted into the MongoDB data storage database ('/error_inserted_tweets').

Error File Naming Convention:

error_tweet-[project_name]-[project_id].txt
OR
error_inserted_tweet-[project_name]-[project_id].txt

Ex.:

error_tweet-test-1234.txt

WHERE

  • project_name - test
  • project_id - 1234

Debugging

Many types of errors can occur when collecting social data, whether it be with an API connection or during processing and insertion. Below we have detailed the errors we have encountered and how our application handles them, in addition to how how to debug each.

In short, we maintain and adapt our connection for the following errors:

  • SSL Errors
  • Processing Exceptions
  • Stream Rate Limiting
  • 420 Rate Limits & 503 Service Unavailable error codes

For all other errors, we are required to disconnect in full:

  • Timeout Errors
  • None 420/503 error codes
  • Twitter disconnect messages

Stream Connection Errors - Errors that sometimes occur when maintaining a persistent HTTP connection to a social API.

  • Timeout Errors: Occurs when our HTTP connection times out for an unknown reason

    • Solution - We need to disconnect and shut down
    • Debugging - Timeout errors are logged in the log files in the /stack/stack/twitter/logs directory
  • SSL Errors: A security issues with our connection

    • Solution - Decrementally attempt to reconnect until successful
    • Debugging - SSL errors are logged in the /stack/stack/twitter/logs directory

Processing Exceptions: Raw data cannot be processed successfully for some reason, often due to an encoding glitch.

  • Raw Data: Raw data from the social API cannot be written to a raw data file

    • Solution - Raw data is logged in the /stack/stack/twitter/logs dir
  • Processing/Insertion Error: An item in a raw data file cannot be processed and/or inserted into the database.

    • Solution - The item is written to an error file (/stack/stack/twitter/error_tweets|error_inserted_tweets)

Twitter Error Messages: An error occurs with our connection to the Twitter Streaming API. You can review connections Twitter error codes here.

  • Stream Rate Limiting: Occurs when a given collection exceeds the maximum amount of data that can be retrieved through free use of the Streaming API (1% of total Twitter traffic.)

    • Solution - Continue as normal, and log the amount of tweets missed
    • Debugging - Logged amounts are in the MongoDB project config database
  • Rate Limiting / Service Unavailable: Occurs when our connection is temporarily paused (rate limited) or unavailable.

    • Solution - Decrementally retry our connection until successful
    • Debugging - Each retry is logged in both log files (/stack/stack/twitter/logs) and the daemon out files (/stack/stack/out)
  • All Other Error Codes: All other error codes are unfortunately fatal and require a full disconnect

    • Solution - Disconnect in full from the Streaming API
    • Debugging - Disconnect process is logged in both log files (/stack/stack/twitter/logs) and the daemon out files (/stack/stack/out)

Twitter Disconnect Messages: Sometimes Twitter explicitly asks us to disconnect from the API. Disconnection messages are listed here.

  • Solution - Disconnect in full from the Streaming API
  • Debugging - Disconnect process is logged in both log files (/stack/stack/twitter/logs) and the daemon out files (/stack/stack/out)

Clone this wiki locally