PTT Text Mining

A comprehensive web crawler and text mining toolkit for scraping PTT (批踢踢實業坊) board data with Chinese NLP support.

Features

Web Crawler

Crawls PTT board posts and comments
Handles 18+ age verification automatically
Exports data to JSONL format (streaming-friendly)
Configurable rate limiting to prevent server overload
Batch processing of multiple boards

Text Processing & Storage

PostgreSQL Database - Structured storage with relational schema
Jieba Text Segmentation - Chinese word segmentation for NLP tasks
Automatic text preprocessing during import
Separate tables for articles and comments
Efficient indexing for fast queries

Text Mining Ready

The segmented text (j_content) enables various text mining analyses:

TF-IDF (Term Frequency-Inverse Document Frequency) - Keyword extraction
Word Cloud - Visualization of frequent terms
Sentiment Analysis - Using Chinese sentiment dictionaries
Topic Modeling - LDA, NMF for discovering topics
N-gram Analysis - Common phrase extraction

Installation

Clone the repository:

git clone <repository-url>
cd ptt-textmining

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

The crawler supports three modes of operation:

1. List Available Boards

View all boards listed in data/boards.txt:

python src/ptt_textmining/crawler.py -l
# or
python src/ptt_textmining/crawler.py --list

2. Crawl a Specific Board

Crawl a single board by name:

python src/ptt_textmining/crawler.py -b NBA
# or
python src/ptt_textmining/crawler.py --board Stock

3. Crawl Multiple Boards from File

First, add board names to data/boards.txt (one per line):

Stock
NBA
Gossiping
MobileComm

Then run:

python src/ptt_textmining/crawler.py -f
# or
python src/ptt_textmining/crawler.py --file

Note: When using file mode (-f), boards are processed sequentially and removed from boards.txt after completion.

Command-Line Options

python src/ptt_textmining/crawler.py -h

Basic Options:

-b, --board BOARD - Crawl a specific board by name
-f, --file - Crawl all boards listed in data/boards.txt
-l, --list - List all available boards in boards.txt
-h, --help - Show help message

Rate Limiting Options:

--article-delay SECONDS - Delay between articles in seconds (default: 0.1)
--page-delay SECONDS - Delay between pages in seconds (default: 0.5)

Examples:

# Use default rate limits
python src/ptt_textmining/crawler.py -b NBA

# Faster crawling (use with caution)
python src/ptt_textmining/crawler.py -b NBA --article-delay 0.05 --page-delay 0.2

# Slower/safer crawling
python src/ptt_textmining/crawler.py -b NBA --article-delay 0.5 --page-delay 1.0

# Crawl from file with custom delays
python src/ptt_textmining/crawler.py -f --article-delay 0.2 --page-delay 0.8

Output files will be saved to the output/ directory as JSONL files.

Output Format

Data is saved in JSONL (JSON Lines) format - one JSON object per line. Each line represents a single article with the following structure:

{"id": "article_id", "author": "author_name", "title": "post_title", "date": "post_date", "content": "main_content", "comments": {"1": {"status": "推/噓/→", "commenter": "user_id", "content": "comment_text", "datetime": "comment_time"}}}

Example formatted for readability:

{
    "id": "M.1234567890.A.123",
    "author": "username (User Name)",
    "title": "[問題] Example Title",
    "date": "Mon Jan 01 12:00:00 2024",
    "content": "Article content here...",
    "comments": {
        "1": {
            "status": "推",
            "commenter": "user123",
            "content": "Great post!",
            "datetime": "01/01 12:30"
        }
    }
}

Output files: output/{board_name}.jsonl (e.g., output/NBA.jsonl)

Reading JSONL Files

Python (pandas):

import pandas as pd
df = pd.read_json('output/NBA.jsonl', lines=True)

Python (standard library):

import json
with open('output/NBA.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        article = json.loads(line)
        print(article['title'])

Command line (jq):

cat output/NBA.jsonl | jq '.title'

Database Setup

This project includes a PostgreSQL database for storing and analyzing crawled data with jieba text segmentation.

1. Start the Database

docker compose up -d

This will start a PostgreSQL container with the database schema automatically created.

2. Import JSONL Data to Database

# Import a specific board
python src/ptt_textmining/import_to_db.py output/NBA.jsonl

# Specify board name manually
python src/ptt_textmining/import_to_db.py output/NBA.jsonl -b NBA

The import script will:

Read JSONL files line by line
Segment Chinese text using jieba
Store articles and comments in separate tables
Add j_content column with jieba-segmented text

3. Database Schema

Articles Table:

id - Primary key
article_id - PTT article ID (unique)
board - Board name
author - Article author
title - Article title
date - Post date
content - Original content
j_content - Jieba-segmented content
created_at, updated_at - Timestamps

Comments Table:

id - Primary key
article_id - Foreign key to articles
comment_index - Comment number
status - Push type (推/噓/→)
commenter - Commenter username
content - Original comment
j_content - Jieba-segmented comment
datetime - Comment timestamp

4. Database Connection

Default credentials:

Host: localhost
Port: 5432
Database: ptt_textmining
User: ptt_user
Password: ptt_password

5. Query Examples

-- Get all articles from a board
SELECT * FROM articles WHERE board = 'NBA';

-- Get article with comments
SELECT a.title, c.commenter, c.content
FROM articles a
LEFT JOIN comments c ON a.article_id = c.article_id
WHERE a.article_id = 'M.1759363294.A.FB7';

-- Search jieba-segmented content
SELECT title, j_content
FROM articles
WHERE j_content LIKE '%球員%';

6. Stop the Database

docker compose down

To remove all data:

docker compose down -v

Text Mining Examples

Once data is imported to the database, you can perform various text mining analyses using the jieba-segmented content:

TF-IDF Analysis

from sklearn.feature_extraction.text import TfidfVectorizer
import psycopg2
import pandas as pd

# Connect to database
conn = psycopg2.connect(
    host="localhost", database="ptt_textmining",
    user="ptt_user", password="ptt_password"
)

# Load segmented content
df = pd.read_sql("SELECT title, j_content FROM articles WHERE board='NBA'", conn)

# Calculate TF-IDF
vectorizer = TfidfVectorizer(max_features=100)
tfidf_matrix = vectorizer.fit_transform(df['j_content'].fillna(''))

# Get top keywords
feature_names = vectorizer.get_feature_names_out()
print("Top keywords:", feature_names[:20])

Word Frequency Analysis

from collections import Counter

# Get all segmented words
cursor = conn.cursor()
cursor.execute("SELECT j_content FROM articles WHERE board='NBA'")
all_text = ' '.join([row[0] for row in cursor.fetchall() if row[0]])

# Count word frequency
words = all_text.split()
word_freq = Counter(words)
print("Most common words:", word_freq.most_common(20))

Word Cloud Generation

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word cloud from segmented text
wordcloud = WordCloud(
    font_path='/System/Library/Fonts/PingFang.ttc',  # Mac Chinese font
    width=800, height=400,
    background_color='white'
).generate(all_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Dependencies

Core Dependencies

requests - HTTP library for making requests
beautifulsoup4 - HTML parsing library
jieba - Chinese text segmentation
psycopg2-binary - PostgreSQL database adapter

Optional (for text mining)

scikit-learn - TF-IDF and machine learning
pandas - Data analysis
wordcloud - Word cloud visualization
matplotlib - Plotting and visualization

Project Structure

ptt-textmining/
├── data/
│   └── boards.txt              # List of boards to crawl
├── database/
│   └── init.sql                # Database schema
├── output/                      # Crawled data (JSONL files)
├── src/
│   └── ptt_textmining/
│       ├── __init__.py
│       ├── crawler.py          # Web crawler
│       └── import_to_db.py     # Database importer with jieba
├── docker-compose.yml          # PostgreSQL container
├── requirements.txt
└── README.md

Notes

Rate Limiting: Default delays are 0.1s between articles and 0.5s between pages. These can be customized using --article-delay and --page-delay options
Output Format: Data is saved in JSONL format (one JSON object per line) for efficient streaming and processing
File Mode: When using -f, boards are processed sequentially and removed from boards.txt after completion
Database: PostgreSQL with jieba text segmentation for Chinese text analysis
Text Mining: The j_content field contains space-separated Chinese words ready for NLP analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PTT Text Mining

Features

Web Crawler

Text Processing & Storage

Text Mining Ready

Installation

Usage

1. List Available Boards

2. Crawl a Specific Board

3. Crawl Multiple Boards from File

Command-Line Options

Output Format

Reading JSONL Files

Database Setup

1. Start the Database

2. Import JSONL Data to Database

3. Database Schema

4. Database Connection

5. Query Examples

6. Stop the Database

Text Mining Examples

TF-IDF Analysis

Word Frequency Analysis

Word Cloud Generation

Dependencies

Core Dependencies

Optional (for text mining)

Project Structure

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
database		database
images		images
src/ptt_textmining		src/ptt_textmining
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

marcus912/ptt-textmining

Folders and files

Latest commit

History

Repository files navigation

PTT Text Mining

Features

Web Crawler

Text Processing & Storage

Text Mining Ready

Installation

Usage

1. List Available Boards

2. Crawl a Specific Board

3. Crawl Multiple Boards from File

Command-Line Options

Output Format

Reading JSONL Files

Database Setup

1. Start the Database

2. Import JSONL Data to Database

3. Database Schema

4. Database Connection

5. Query Examples

6. Stop the Database

Text Mining Examples

TF-IDF Analysis

Word Frequency Analysis

Word Cloud Generation

Dependencies

Core Dependencies

Optional (for text mining)

Project Structure

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages