Threads Post Scraper

Overview

A Python-based scraper for Threads.net posts, comments, and engagement metrics using Playwright and Docker. The scraper authenticates with user-provided session cookies to collect post content, images, comments, and metrics (likes, reposts) from specified Threads accounts.

Features

Scrapes multiple Threads user profiles in a single run
Extracts post text content, images, and videos
Collects engagement metrics (likes and reposts counts)
Gathers comments on posts with configurable limits
Uses randomized delays for respectful scraping
Stores data in structured JSON format for easy analysis
Runs in headless or visible browser mode
Implements retry mechanisms for robust operation
Comprehensive logging of scraper activities

Prerequisites

Docker and Docker Compose installed on your system
A valid and active Threads account (to export cookies from)
Basic familiarity with command line operations

Setup Instructions

Clone or Download the Repository

git clone https://github.com/yourusername/threads-scraper.git
cd threads-scraper

Cookie Extraction (Critical)

To authenticate with Threads, you need to export cookies from your logged-in browser session:

Chrome:

Log into Threads.net with your account
Open Chrome DevTools (F12 or Right-click > Inspect)
Go to Application tab > Storage > Cookies
Look for cookies under domains:
- https://www.threads.net
- https://www.instagram.com (may contain relevant authentication cookies)
Important cookies to look for include:
- sessionid
- ds_user_id
- csrftoken
- ig_did
- mid
Note: The exact set may vary depending on your session.
Export the full cookie objects (including name, value, domain, path, expires, httpOnly, secure, sameSite) to cookies.json

Firefox:

Log into Threads.net
Open Firefox DevTools (F12)
Go to Storage tab > Cookies
Follow similar steps as for Chrome to export cookies

The cookies.json file should be structured as a JSON array of cookie objects. See cookies.json.example for the required format.

⚠️ Important: Cookies expire periodically and may need to be re-exported if the scraper stops authenticating properly.

Configuration (config.ini)

Edit config.ini to configure the scraper behavior:

[General]
target_usernames = user1,user2,user3  # Comma-separated list of Threads usernames to scrape
cookie_file_path = cookies.json       # Path to your exported cookies file
output_directory = ./output_data      # Directory to save scraped JSON data
log_file = ./logs/scraper.log         # Path to log file
headless_browser = true               # Set to false to see browser activities

[ScrapingDelays]
action_min = 2.5                      # Minimum delay for general actions (seconds)
action_max = 5.0                      # Maximum delay for general actions
post_min = 5.0                        # Minimum delay between scraping individual posts
post_max = 10.0                       # Maximum delay between scraping individual posts
scroll_min = 1.5                      # Minimum delay after a scroll action
scroll_max = 3.0                      # Maximum delay after a scroll action

[Retries]
max_attempts = 3                      # Maximum retry attempts for failing operations
retry_delay_seconds = 5               # Delay before retrying operations

Build Docker Image

docker-compose build scraper_app

Running the Scraper

docker-compose run --rm scraper_app

The --rm flag removes the container after completion, which is good practice to avoid accumulating stopped containers.

Output

The scraper produces JSON files in the output_directory specified in your config.ini. Files are named using the pattern username_YYYYMMDD_HHMMSS.json.

Output Structure

{
  "scraped_username": "username",
  "scrape_timestamp": "2023-06-30T15:45:20.123456+00:00",
  "posts": [
    {
      "post_url": "https://www.threads.net/@username/post/ABC123",
      "text": "The content of the post...",
      "image_urls": [
        "https://example.com/image1.jpg",
        "https://example.com/image2.jpg"
      ],
      "likes": 123,
      "reposts": 45,
      "comment_texts": [
        "This is a comment!",
        "Another comment here..."
      ]
    },
    // Additional posts...
  ]
}

Troubleshooting Common Issues

Authentication Issues

Problem: Scraper doesn't seem logged in, gets 404 errors, or only accesses a few posts

Cause: Incorrect or incomplete cookies.json
Solution:
- Re-export cookies carefully from your browser
- Ensure ALL necessary cookies from both .threads.net and .instagram.com are included
- Verify JSON validity with a JSON validator
- Check that the cookie_file_path in config.ini is correct

Data Extraction Issues

Problem: No data extracted or empty fields, errors related to 'selectors'

Cause: Threads website HTML structure may have changed
Solution:
- Update CSS selectors in constants.py by inspecting the current Threads website
- This requires some HTML/CSS knowledge to identify new selectors

Permission Errors

Problem: Permission denied errors for output/logs directory

Cause: Docker volume mount permissions issues
Solution:
- Check docker-compose.yml volume mounts
- Ensure directories exist or can be created by your user
- Run Docker commands with sudo if needed

Timeout Issues

Problem: Script times out frequently

Cause: Network issues, or Threads rate-limiting
Solution:
- Increase timeout values in the code
- Increase delay values in config.ini
- Check your network connection

Empty Results

Problem: No posts found for a valid username

Cause: User profile might be private or doesn't exist
Solution:
- Verify the username exists and is public
- Check if you can view their profile when logged in through a browser

Disclaimer

This tool is for personal, educational, or research purposes only, where permitted.
Users are solely responsible for complying with Threads' Terms of Service and any applicable laws regarding data scraping and account usage.
Scraping can be intensive on websites and may carry risks for your Threads account if not used responsibly. Use polite delays.
The maintainers of this project are not responsible for any misuse or violations committed by users.
The project may require updates as Threads changes its website structure.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
test_backups		test_backups
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.ini		config.ini
constants.py		constants.py
cookies.json.example		cookies.json.example
docker-compose.yml		docker-compose.yml
post_utils.py		post_utils.py
profile_utils.py		profile_utils.py
requirements.txt		requirements.txt
scraper.py		scraper.py
test_script.py		test_script.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Threads Post Scraper

Overview

Features

Prerequisites

Setup Instructions

Clone or Download the Repository

Cookie Extraction (Critical)

Chrome:

Firefox:

Configuration (config.ini)

Build Docker Image

Running the Scraper

Output

Output Structure

Troubleshooting Common Issues

Authentication Issues

Data Extraction Issues

Permission Errors

Timeout Issues

Empty Results

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Threads Post Scraper

Overview

Features

Prerequisites

Setup Instructions

Clone or Download the Repository

Cookie Extraction (Critical)

Chrome:

Firefox:

Configuration (config.ini)

Build Docker Image

Running the Scraper

Output

Output Structure

Troubleshooting Common Issues

Authentication Issues

Data Extraction Issues

Permission Errors

Timeout Issues

Empty Results

Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages