A Python-based scraper for Threads.net posts, comments, and engagement metrics using Playwright and Docker. The scraper authenticates with user-provided session cookies to collect post content, images, comments, and metrics (likes, reposts) from specified Threads accounts.
- Scrapes multiple Threads user profiles in a single run
- Extracts post text content, images, and videos
- Collects engagement metrics (likes and reposts counts)
- Gathers comments on posts with configurable limits
- Uses randomized delays for respectful scraping
- Stores data in structured JSON format for easy analysis
- Runs in headless or visible browser mode
- Implements retry mechanisms for robust operation
- Comprehensive logging of scraper activities
- Docker and Docker Compose installed on your system
- A valid and active Threads account (to export cookies from)
- Basic familiarity with command line operations
git clone https://github.com/yourusername/threads-scraper.git
cd threads-scraper
To authenticate with Threads, you need to export cookies from your logged-in browser session:
-
Log into Threads.net with your account
-
Open Chrome DevTools (F12 or Right-click > Inspect)
-
Go to Application tab > Storage > Cookies
-
Look for cookies under domains:
https://www.threads.nethttps://www.instagram.com(may contain relevant authentication cookies)
-
Important cookies to look for include:
sessionidds_user_idcsrftokenig_didmid
Note: The exact set may vary depending on your session.
-
Export the full cookie objects (including name, value, domain, path, expires, httpOnly, secure, sameSite) to
cookies.json
- Log into Threads.net
- Open Firefox DevTools (F12)
- Go to Storage tab > Cookies
- Follow similar steps as for Chrome to export cookies
The cookies.json file should be structured as a JSON array of cookie objects. See cookies.json.example for the required format.
Edit config.ini to configure the scraper behavior:
[General]
target_usernames = user1,user2,user3 # Comma-separated list of Threads usernames to scrape
cookie_file_path = cookies.json # Path to your exported cookies file
output_directory = ./output_data # Directory to save scraped JSON data
log_file = ./logs/scraper.log # Path to log file
headless_browser = true # Set to false to see browser activities
[ScrapingDelays]
action_min = 2.5 # Minimum delay for general actions (seconds)
action_max = 5.0 # Maximum delay for general actions
post_min = 5.0 # Minimum delay between scraping individual posts
post_max = 10.0 # Maximum delay between scraping individual posts
scroll_min = 1.5 # Minimum delay after a scroll action
scroll_max = 3.0 # Maximum delay after a scroll action
[Retries]
max_attempts = 3 # Maximum retry attempts for failing operations
retry_delay_seconds = 5 # Delay before retrying operationsdocker-compose build scraper_app
docker-compose run --rm scraper_app
The --rm flag removes the container after completion, which is good practice to avoid accumulating stopped containers.
The scraper produces JSON files in the output_directory specified in your config.ini. Files are named using the pattern username_YYYYMMDD_HHMMSS.json.
{
"scraped_username": "username",
"scrape_timestamp": "2023-06-30T15:45:20.123456+00:00",
"posts": [
{
"post_url": "https://www.threads.net/@username/post/ABC123",
"text": "The content of the post...",
"image_urls": [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
],
"likes": 123,
"reposts": 45,
"comment_texts": [
"This is a comment!",
"Another comment here..."
]
},
// Additional posts...
]
}Problem: Scraper doesn't seem logged in, gets 404 errors, or only accesses a few posts
- Cause: Incorrect or incomplete cookies.json
- Solution:
- Re-export cookies carefully from your browser
- Ensure ALL necessary cookies from both
.threads.netand.instagram.comare included - Verify JSON validity with a JSON validator
- Check that the
cookie_file_pathin config.ini is correct
Problem: No data extracted or empty fields, errors related to 'selectors'
- Cause: Threads website HTML structure may have changed
- Solution:
- Update CSS selectors in
constants.pyby inspecting the current Threads website - This requires some HTML/CSS knowledge to identify new selectors
- Update CSS selectors in
Problem: Permission denied errors for output/logs directory
- Cause: Docker volume mount permissions issues
- Solution:
- Check
docker-compose.ymlvolume mounts - Ensure directories exist or can be created by your user
- Run Docker commands with sudo if needed
- Check
Problem: Script times out frequently
- Cause: Network issues, or Threads rate-limiting
- Solution:
- Increase
timeoutvalues in the code - Increase delay values in config.ini
- Check your network connection
- Increase
Problem: No posts found for a valid username
- Cause: User profile might be private or doesn't exist
- Solution:
- Verify the username exists and is public
- Check if you can view their profile when logged in through a browser
- This tool is for personal, educational, or research purposes only, where permitted.
- Users are solely responsible for complying with Threads' Terms of Service and any applicable laws regarding data scraping and account usage.
- Scraping can be intensive on websites and may carry risks for your Threads account if not used responsibly. Use polite delays.
- The maintainers of this project are not responsible for any misuse or violations committed by users.
- The project may require updates as Threads changes its website structure.