A comprehensive web crawler and text mining toolkit for scraping PTT (批踢踢實業坊) board data with Chinese NLP support.
- Crawls PTT board posts and comments
- Handles 18+ age verification automatically
- Exports data to JSONL format (streaming-friendly)
- Configurable rate limiting to prevent server overload
- Batch processing of multiple boards
- PostgreSQL Database - Structured storage with relational schema
- Jieba Text Segmentation - Chinese word segmentation for NLP tasks
- Automatic text preprocessing during import
- Separate tables for articles and comments
- Efficient indexing for fast queries
The segmented text (j_content) enables various text mining analyses:
- TF-IDF (Term Frequency-Inverse Document Frequency) - Keyword extraction
- Word Cloud - Visualization of frequent terms
- Sentiment Analysis - Using Chinese sentiment dictionaries
- Topic Modeling - LDA, NMF for discovering topics
- N-gram Analysis - Common phrase extraction
- Clone the repository:
git clone <repository-url>
cd ptt-textmining- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtThe crawler supports three modes of operation:
View all boards listed in data/boards.txt:
python src/ptt_textmining/crawler.py -l
# or
python src/ptt_textmining/crawler.py --listCrawl a single board by name:
python src/ptt_textmining/crawler.py -b NBA
# or
python src/ptt_textmining/crawler.py --board StockFirst, add board names to data/boards.txt (one per line):
Stock
NBA
Gossiping
MobileComm
Then run:
python src/ptt_textmining/crawler.py -f
# or
python src/ptt_textmining/crawler.py --fileNote: When using file mode (-f), boards are processed sequentially and removed from boards.txt after completion.
python src/ptt_textmining/crawler.py -hBasic Options:
-b, --board BOARD- Crawl a specific board by name-f, --file- Crawl all boards listed in data/boards.txt-l, --list- List all available boards in boards.txt-h, --help- Show help message
Rate Limiting Options:
--article-delay SECONDS- Delay between articles in seconds (default: 0.1)--page-delay SECONDS- Delay between pages in seconds (default: 0.5)
Examples:
# Use default rate limits
python src/ptt_textmining/crawler.py -b NBA
# Faster crawling (use with caution)
python src/ptt_textmining/crawler.py -b NBA --article-delay 0.05 --page-delay 0.2
# Slower/safer crawling
python src/ptt_textmining/crawler.py -b NBA --article-delay 0.5 --page-delay 1.0
# Crawl from file with custom delays
python src/ptt_textmining/crawler.py -f --article-delay 0.2 --page-delay 0.8Output files will be saved to the output/ directory as JSONL files.
Data is saved in JSONL (JSON Lines) format - one JSON object per line. Each line represents a single article with the following structure:
{"id": "article_id", "author": "author_name", "title": "post_title", "date": "post_date", "content": "main_content", "comments": {"1": {"status": "推/噓/→", "commenter": "user_id", "content": "comment_text", "datetime": "comment_time"}}}Example formatted for readability:
{
"id": "M.1234567890.A.123",
"author": "username (User Name)",
"title": "[問題] Example Title",
"date": "Mon Jan 01 12:00:00 2024",
"content": "Article content here...",
"comments": {
"1": {
"status": "推",
"commenter": "user123",
"content": "Great post!",
"datetime": "01/01 12:30"
}
}
}Output files: output/{board_name}.jsonl (e.g., output/NBA.jsonl)
Python (pandas):
import pandas as pd
df = pd.read_json('output/NBA.jsonl', lines=True)Python (standard library):
import json
with open('output/NBA.jsonl', 'r', encoding='utf-8') as f:
for line in f:
article = json.loads(line)
print(article['title'])Command line (jq):
cat output/NBA.jsonl | jq '.title'This project includes a PostgreSQL database for storing and analyzing crawled data with jieba text segmentation.
docker compose up -dThis will start a PostgreSQL container with the database schema automatically created.
# Import a specific board
python src/ptt_textmining/import_to_db.py output/NBA.jsonl
# Specify board name manually
python src/ptt_textmining/import_to_db.py output/NBA.jsonl -b NBAThe import script will:
- Read JSONL files line by line
- Segment Chinese text using jieba
- Store articles and comments in separate tables
- Add
j_contentcolumn with jieba-segmented text
Articles Table:
id- Primary keyarticle_id- PTT article ID (unique)board- Board nameauthor- Article authortitle- Article titledate- Post datecontent- Original contentj_content- Jieba-segmented contentcreated_at,updated_at- Timestamps
Comments Table:
id- Primary keyarticle_id- Foreign key to articlescomment_index- Comment numberstatus- Push type (推/噓/→)commenter- Commenter usernamecontent- Original commentj_content- Jieba-segmented commentdatetime- Comment timestamp
Default credentials:
- Host: localhost
- Port: 5432
- Database: ptt_textmining
- User: ptt_user
- Password: ptt_password
-- Get all articles from a board
SELECT * FROM articles WHERE board = 'NBA';
-- Get article with comments
SELECT a.title, c.commenter, c.content
FROM articles a
LEFT JOIN comments c ON a.article_id = c.article_id
WHERE a.article_id = 'M.1759363294.A.FB7';
-- Search jieba-segmented content
SELECT title, j_content
FROM articles
WHERE j_content LIKE '%球員%';docker compose downTo remove all data:
docker compose down -vOnce data is imported to the database, you can perform various text mining analyses using the jieba-segmented content:
from sklearn.feature_extraction.text import TfidfVectorizer
import psycopg2
import pandas as pd
# Connect to database
conn = psycopg2.connect(
host="localhost", database="ptt_textmining",
user="ptt_user", password="ptt_password"
)
# Load segmented content
df = pd.read_sql("SELECT title, j_content FROM articles WHERE board='NBA'", conn)
# Calculate TF-IDF
vectorizer = TfidfVectorizer(max_features=100)
tfidf_matrix = vectorizer.fit_transform(df['j_content'].fillna(''))
# Get top keywords
feature_names = vectorizer.get_feature_names_out()
print("Top keywords:", feature_names[:20])from collections import Counter
# Get all segmented words
cursor = conn.cursor()
cursor.execute("SELECT j_content FROM articles WHERE board='NBA'")
all_text = ' '.join([row[0] for row in cursor.fetchall() if row[0]])
# Count word frequency
words = all_text.split()
word_freq = Counter(words)
print("Most common words:", word_freq.most_common(20))from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Generate word cloud from segmented text
wordcloud = WordCloud(
font_path='/System/Library/Fonts/PingFang.ttc', # Mac Chinese font
width=800, height=400,
background_color='white'
).generate(all_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()requests- HTTP library for making requestsbeautifulsoup4- HTML parsing libraryjieba- Chinese text segmentationpsycopg2-binary- PostgreSQL database adapter
scikit-learn- TF-IDF and machine learningpandas- Data analysiswordcloud- Word cloud visualizationmatplotlib- Plotting and visualization
ptt-textmining/
├── data/
│ └── boards.txt # List of boards to crawl
├── database/
│ └── init.sql # Database schema
├── output/ # Crawled data (JSONL files)
├── src/
│ └── ptt_textmining/
│ ├── __init__.py
│ ├── crawler.py # Web crawler
│ └── import_to_db.py # Database importer with jieba
├── docker-compose.yml # PostgreSQL container
├── requirements.txt
└── README.md
- Rate Limiting: Default delays are 0.1s between articles and 0.5s between pages. These can be customized using
--article-delayand--page-delayoptions - Output Format: Data is saved in JSONL format (one JSON object per line) for efficient streaming and processing
- File Mode: When using
-f, boards are processed sequentially and removed fromboards.txtafter completion - Database: PostgreSQL with jieba text segmentation for Chinese text analysis
- Text Mining: The
j_contentfield contains space-separated Chinese words ready for NLP analysis