Skip to content

rekon307/ThreadsScraper

Repository files navigation

Threads Post Scraper

Overview

A Python-based scraper for Threads.net posts, comments, and engagement metrics using Playwright and Docker. The scraper authenticates with user-provided session cookies to collect post content, images, comments, and metrics (likes, reposts) from specified Threads accounts.

Features

  • Scrapes multiple Threads user profiles in a single run
  • Extracts post text content, images, and videos
  • Collects engagement metrics (likes and reposts counts)
  • Gathers comments on posts with configurable limits
  • Uses randomized delays for respectful scraping
  • Stores data in structured JSON format for easy analysis
  • Runs in headless or visible browser mode
  • Implements retry mechanisms for robust operation
  • Comprehensive logging of scraper activities

Prerequisites

  • Docker and Docker Compose installed on your system
  • A valid and active Threads account (to export cookies from)
  • Basic familiarity with command line operations

Setup Instructions

Clone or Download the Repository

git clone https://github.com/yourusername/threads-scraper.git
cd threads-scraper

Cookie Extraction (Critical)

To authenticate with Threads, you need to export cookies from your logged-in browser session:

Chrome:

  1. Log into Threads.net with your account

  2. Open Chrome DevTools (F12 or Right-click > Inspect)

  3. Go to Application tab > Storage > Cookies

  4. Look for cookies under domains:

    • https://www.threads.net
    • https://www.instagram.com (may contain relevant authentication cookies)
  5. Important cookies to look for include:

    • sessionid
    • ds_user_id
    • csrftoken
    • ig_did
    • mid

    Note: The exact set may vary depending on your session.

  6. Export the full cookie objects (including name, value, domain, path, expires, httpOnly, secure, sameSite) to cookies.json

Firefox:

  1. Log into Threads.net
  2. Open Firefox DevTools (F12)
  3. Go to Storage tab > Cookies
  4. Follow similar steps as for Chrome to export cookies

The cookies.json file should be structured as a JSON array of cookie objects. See cookies.json.example for the required format.

⚠️ Important: Cookies expire periodically and may need to be re-exported if the scraper stops authenticating properly.

Configuration (config.ini)

Edit config.ini to configure the scraper behavior:

[General]
target_usernames = user1,user2,user3  # Comma-separated list of Threads usernames to scrape
cookie_file_path = cookies.json       # Path to your exported cookies file
output_directory = ./output_data      # Directory to save scraped JSON data
log_file = ./logs/scraper.log         # Path to log file
headless_browser = true               # Set to false to see browser activities

[ScrapingDelays]
action_min = 2.5                      # Minimum delay for general actions (seconds)
action_max = 5.0                      # Maximum delay for general actions
post_min = 5.0                        # Minimum delay between scraping individual posts
post_max = 10.0                       # Maximum delay between scraping individual posts
scroll_min = 1.5                      # Minimum delay after a scroll action
scroll_max = 3.0                      # Maximum delay after a scroll action

[Retries]
max_attempts = 3                      # Maximum retry attempts for failing operations
retry_delay_seconds = 5               # Delay before retrying operations

Build Docker Image

docker-compose build scraper_app

Running the Scraper

docker-compose run --rm scraper_app

The --rm flag removes the container after completion, which is good practice to avoid accumulating stopped containers.

Output

The scraper produces JSON files in the output_directory specified in your config.ini. Files are named using the pattern username_YYYYMMDD_HHMMSS.json.

Output Structure

{
  "scraped_username": "username",
  "scrape_timestamp": "2023-06-30T15:45:20.123456+00:00",
  "posts": [
    {
      "post_url": "https://www.threads.net/@username/post/ABC123",
      "text": "The content of the post...",
      "image_urls": [
        "https://example.com/image1.jpg",
        "https://example.com/image2.jpg"
      ],
      "likes": 123,
      "reposts": 45,
      "comment_texts": [
        "This is a comment!",
        "Another comment here..."
      ]
    },
    // Additional posts...
  ]
}

Troubleshooting Common Issues

Authentication Issues

Problem: Scraper doesn't seem logged in, gets 404 errors, or only accesses a few posts

  • Cause: Incorrect or incomplete cookies.json
  • Solution:
    • Re-export cookies carefully from your browser
    • Ensure ALL necessary cookies from both .threads.net and .instagram.com are included
    • Verify JSON validity with a JSON validator
    • Check that the cookie_file_path in config.ini is correct

Data Extraction Issues

Problem: No data extracted or empty fields, errors related to 'selectors'

  • Cause: Threads website HTML structure may have changed
  • Solution:
    • Update CSS selectors in constants.py by inspecting the current Threads website
    • This requires some HTML/CSS knowledge to identify new selectors

Permission Errors

Problem: Permission denied errors for output/logs directory

  • Cause: Docker volume mount permissions issues
  • Solution:
    • Check docker-compose.yml volume mounts
    • Ensure directories exist or can be created by your user
    • Run Docker commands with sudo if needed

Timeout Issues

Problem: Script times out frequently

  • Cause: Network issues, or Threads rate-limiting
  • Solution:
    • Increase timeout values in the code
    • Increase delay values in config.ini
    • Check your network connection

Empty Results

Problem: No posts found for a valid username

  • Cause: User profile might be private or doesn't exist
  • Solution:
    • Verify the username exists and is public
    • Check if you can view their profile when logged in through a browser

Disclaimer

  • This tool is for personal, educational, or research purposes only, where permitted.
  • Users are solely responsible for complying with Threads' Terms of Service and any applicable laws regarding data scraping and account usage.
  • Scraping can be intensive on websites and may carry risks for your Threads account if not used responsibly. Use polite delays.
  • The maintainers of this project are not responsible for any misuse or violations committed by users.
  • The project may require updates as Threads changes its website structure.

About

Threads.net scraper using Playwright and Docker

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors