Skip to content

daoxusheng/scholar-alerts-assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š Scholar Alerts Assistant

Python License Gmail API

Save your inbox from Google Scholar alert overload.

Scholar Alerts Assistant automatically processes your Google Scholar alert emails, filters duplicates, highlights priority papers based on your keywords, and sends you a beautifully formatted digest email. No more cluttered inbox - just the papers that matter to you.

✨ Features

  • High Performance: Batch Gmail API processing with 95% fewer API calls
  • Smart Filtering: Keyword-based prioritization and blacklist filtering
  • HTML Digest: Beautiful formatted emails with highlighted keywords
  • Duplicate Detection: Persistent database prevents repeat notifications
  • Fast Processing: Optimized algorithms for large email volumes
  • Safe Processing: Mark as read or delete processed emails
  • Automation Ready: Perfect for cron jobs and scheduled execution

πŸ”§ Quick Start

Prerequisites

  1. Gmail Labels: Set up an automatic filter to label your Google Scholar alert emails (Gmail Filter Guide)

  2. Gmail API Credentials:

    • Enable the Gmail API
    • Download credentials.json to the ./data/ folder

Installation

# Clone the repository
git clone https://github.com/daoxusheng/scholar-alerts-assistant.git
cd scholar-alerts-assistant

# Install dependencies
pip install google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client

# Create data directory structure
mkdir -p data
# Place your credentials.json in the data folder

Configuration

Edit the configuration variables in main.py:

# Gmail labels to filter for scholar alerts
SCHOLAR_LABEL = [
    "UNREAD",
    "Academia",  # Replace with your actual label
]

# Email address to receive digest
DIGEST_ADDRESS = "your-email@gmail.com"  # Replace with your email

# Processing mode
DELETE_MODE = False  # True to delete, False to mark as read

First Run

python main.py

On first run, your browser will open for Gmail API authorization. This creates a token.json file for future runs.

πŸ“ Project Structure

scholar-alerts-assistant/
β”œβ”€β”€ main.py                 # Main script and configuration
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ assistant.py        # Core processing logic
β”‚   β”œβ”€β”€ gmail.py           # Gmail API integration
β”‚   β”œβ”€β”€ formatter.py       # HTML email formatting
β”‚   └── parser.py          # Email content parsing
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ credentials.json   # Gmail API credentials (you provide)
β”‚   β”œβ”€β”€ token.json         # Auto-generated OAuth token
β”‚   β”œβ”€β”€ db.json           # Paper database (auto-generated)
β”‚   β”œβ”€β”€ keywords.json     # Priority keywords (optional)
β”‚   └── blacklist.json    # Filtered terms (optional)
└── README.md

βš™οΈ Advanced Configuration

Keyword Prioritization

Create data/keywords.json to prioritize papers containing specific terms:

[
  "machine learning",
  "neural networks",
  "deep learning",
  "artificial intelligence",
  "computer vision"
]

Papers matching these keywords will appear in the "Highly-related Papers" section of your digest.

Blacklist Filtering

Create data/blacklist.json to filter out unwanted papers:

[
  "retracted",
  "withdrawn",
  "spam term",
  "unwanted topic"
]

Papers containing blacklisted terms will be automatically excluded from your digest.

Configuration Options

Variable Description Default
VERBOSE Logging level (0=silent, 1=info) 1
DELETE_MODE Delete processed emails vs mark as read False
USER_ID Gmail user ID "me"
SCHOLAR_LABEL Gmail labels to filter ["UNREAD", "Academia"]
DIGEST_ADDRESS Destination email for digest Required
DATA_PATH Directory for data files "./data/"
BATCH_SIZE Gmail API batch processing size 10

πŸš€ Performance Features

  • Configurable Batch Processing: Adjustable batch sizes (default: 10 messages per batch)
  • Rate Limiting Protection: Exponential backoff with automatic retry on API limits
  • Smart Caching: Compiled regex patterns and preloaded data
  • Optimized Filtering: Single-pass algorithms for large datasets
  • Early Termination: Skips processing when no emails found

πŸ€– Automation

Cron Job Setup

Run daily at 9 AM:

# Edit crontab
crontab -e

# Add this line
0 9 * * * cd /path/to/scholar-alerts-assistant && python main.py

Systemd Timer (Linux)

Create /etc/systemd/system/scholar-alerts.service:

[Unit]
Description=Scholar Alerts Processor

[Service]
Type=oneshot
User=yourusername
WorkingDirectory=/path/to/scholar-alerts-assistant
ExecStart=/usr/bin/python main.py

Create /etc/systemd/system/scholar-alerts.timer:

[Unit]
Description=Run Scholar Alerts daily

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

Enable and start:

sudo systemctl enable scholar-alerts.timer
sudo systemctl start scholar-alerts.timer

πŸ› Troubleshooting

Common Issues

"No module named 'google'"

pip install google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client

"credentials.json not found"

  • Ensure credentials.json is in the ./data/ folder
  • Verify Gmail API is enabled in Google Cloud Console

"No emails found"

  • Check your Gmail label name matches SCHOLAR_LABEL
  • Verify Scholar alerts are being labeled correctly
  • Run with VERBOSE = 1 for debugging info

"Permission denied" errors

  • Delete data/token.json and re-authenticate
  • Check Gmail API scopes in Google Cloud Console

Gmail API rate limit errors (429)

  • Reduce BATCH_SIZE to 5 or lower in main.py
  • The tool automatically retries with exponential backoff
  • For persistent issues, try BATCH_SIZE = 1

Large email processing slow

  • Increase BATCH_SIZE to 20-50 if no rate limit issues
  • Default batch size (10) balances speed and reliability
  • For 100+ emails, expect 30-60 seconds processing time

Debug Mode

Enable verbose logging by setting:

VERBOSE = 1

This shows detailed processing information including:

  • Number of emails fetched
  • Duplicate detection results
  • Blacklist filtering stats
  • Keyword prioritization counts

πŸ“Š Example Output

[INFO] 45 Emails with {['UNREAD', 'Academia']} labels found.
[INFO] 38 in 45 unique entries found.
[INFO] Removed 3 entries by blacklist: {'retracted', 'spam'}
[INFO] 12 entries prioritized.
[INFO] Digest Message formatted.
[INFO] Digest Message sent.
[INFO] 45 messages marked as read.
[LOG] Updated 12 paper entries.

πŸ”’ Privacy & Security

  • All processing happens locally on your machine
  • Only Gmail API access required (read emails, send digest)
  • No data sent to external services
  • Credentials stored securely using OAuth2 standards

🀝 Contributing

Contributions welcome! Please feel free to submit pull requests or open issues for:

  • Performance improvements
  • New features
  • Bug fixes
  • Documentation enhancements

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Save your inbox from Scholar overload. Automatically process Google Scholar alert emails and get a clean, prioritized digest.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages