把你喜爱的Threads用户帖子爬虫为电子书

一个用于爬取Threads用户帖子并保存为EPUB|PDF|TXT格式的电子书的工具。

没有使用selenium，而是使用了playwright进行爬虫，具体原理参考scrapfly这篇文章。项目的结构和代码模仿了weiboSpider。本项目全部代码均由cursor生成。

限制与待更新功能

暂不支持模拟登录。由于Threads作为游客只能看有限数量的帖子，目前只能爬取用户主页前30条左右。

待更新功能：

模拟用自己的账号登录，就能爬取所有帖子。
只爬取某个时间范围的帖子。

功能

爬取指定用户的原创帖子
支持爬取帖子评论（可选）
支持爬取帖子图片（可选）
将内容转换为EPUB|PDF|TXT格式电子书

安装

克隆仓库：

git clone https://github.com/kokonot8/threads_to_ebook.git
cd threads-spider

安装依赖：

pip install -r requirements.txt

使用方法

基本用法：

python -m threads_spider username

完整参数：

python -m threads_spider username --comments --images --output ./output --format epub

参数说明：

username: 要爬取的Threads用户名（必需）
--comments: 是否爬取评论（可选）
--images: 是否爬取图片（可选）
--output: 输出目录（默认为 "output"）
--format: 输出格式，支持 epub|pdf|txt（默认为 epub）

输出

程序会在指定的输出目录中生成对应格式的文件：

EPUB格式：{username}_threads.epub
PDF格式：{username}_threads.pdf
TXT格式：{username}_threads.txt

依赖说明

requests: HTTP请求
beautifulsoup4: HTML解析
playwright: 网页自动化和爬取
ebooklib: EPUB文件生成
pillow: 图片处理
reportlab: PDF文件生成
jmespath: JSON数据解析
parsel: HTML/XML解析
nested-lookup: 嵌套数据结构搜索

许可证

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
output		output
threads_spider		threads_spider
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

把你喜爱的Threads用户帖子爬虫为电子书

限制与待更新功能

功能

安装

使用方法

输出

依赖说明

许可证

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

把你喜爱的Threads用户帖子爬虫为电子书

限制与待更新功能

功能

安装

使用方法

输出

依赖说明

许可证

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages