| Version | Supported |
|---|---|
| 1.0.x | ✅ |
| < 1.0 | ❌ |
PDF Content Extractor & Translator is designed with data sovereignty as a core principle:
┌─────────────────────────────────────────────────────────────┐
│ YOUR MACHINE │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ PDF Files │→ │ Application │→ │ Processed Output │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ ↓ │
│ ┌─────────────┐ │
│ │ Local AI │ │
│ │ (Ollama) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
║
╳ NO EXTERNAL CONNECTIONS
║
┌─────────────────────────────────────────────────────────────┐
│ INTERNET │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Cloud APIs │ │ Analytics │ │ Third-party Services│ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Guarantees:
- ✅ All PDF processing happens locally
- ✅ No telemetry or analytics collection
- ✅ No data transmitted to external servers
- ✅ No API keys required for any feature
- ✅ Works completely offline after initial setup
We take security seriously. If you discover a vulnerability, please follow responsible disclosure:
For sensitive security issues:
- Email: security@[project-domain].com (or maintainer email)
- Subject:
[SECURITY] Brief description - Include:
- Description of the vulnerability
- Steps to reproduce
- Potential impact
- Suggested fix (if any)
Response Timeline:
- Acknowledgment: Within 48 hours
- Initial assessment: Within 7 days
- Resolution target: Within 30 days (severity-dependent)
For non-sensitive issues:
- Open a GitHub Issue
- Label it with
security - Provide detailed reproduction steps
| Control | Implementation | Location |
|---|---|---|
| Filename Sanitization | werkzeug.secure_filename() |
app.py |
| File Type Validation | PDF magic bytes check | is_valid_file() |
| Path Traversal Prevention | Allowed directory whitelist | mcp_server.py |
| Size Limits | Configurable upload limits | Flask config |
| Numeric Input Validation | Type casting with error handling | API endpoints |
# Example: Path validation
ALLOWED_DIRECTORIES = [
Path.home() / "Documents",
Path.home() / "Downloads",
Path.home() / "Desktop",
Path(__file__).parent.resolve()
]
def validate_path(file_path: str) -> bool:
resolved = Path(file_path).resolve()
return any(
resolved.is_relative_to(allowed)
for allowed in ALLOWED_DIRECTORIES
)| Component | Binding | External Access |
|---|---|---|
| Flask App | localhost:5000 |
❌ None by default |
| Redis | localhost:6379 |
❌ Local only |
| Ollama | localhost:11434 |
❌ Local only |
| ChromaDB | Embedded | ❌ No network |
| Data Type | Storage | Lifecycle |
|---|---|---|
| Uploaded PDFs | uploads/ |
User-managed |
| Processed files | outputs/ |
User-managed |
| AI embeddings | chroma_db/ |
Persistent |
| Logs | logs/ |
Auto-rotated |
Recommendation: Implement regular cleanup:
# Clear files older than 7 days
find uploads/ outputs/ -type f -mtime +7 -delete| Area | Status | Notes |
|---|---|---|
| Authentication | ❌ None | Designed for single-user local use |
| Authorization | ❌ None | All users have full access |
| Encryption at Rest | ❌ None | Files stored unencrypted |
| HTTPS | ❌ Not built-in | Use reverse proxy for production |
| Rate Limiting | ❌ None | Add via reverse proxy |
If exposing to a network:
-
Use a Reverse Proxy:
server { listen 443 ssl; server_name pdf.example.com; ssl_certificate /path/to/cert.pem; ssl_certificate_key /path/to/key.pem; location / { proxy_pass http://localhost:5000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } }
-
Add Authentication:
- OAuth2 Proxy
- HTTP Basic Auth
- Flask-Login
-
Enable Rate Limiting:
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s; location /api/ { limit_req zone=api burst=20; proxy_pass http://localhost:5000; }
-
Restrict File Uploads:
app.config['MAX_CONTENT_LENGTH'] = 50 * 1024 * 1024 # 50MB
| Regulation | Applicability | Status |
|---|---|---|
| GDPR | EU users | ✅ Compliant (no personal data leaves device) |
| HIPAA | Healthcare | ✅ Suitable (self-hosted, no PHI transmission) |
| CCPA | California | ✅ Compliant (no data collection) |
| SOC 2 | Enterprise |
- Data Controller: The user operating the application
- Data Processor: N/A (no third-party processing)
- Data Retention: User-controlled (no automatic deletion)
- Data Export: All files accessible in filesystem
We use the following tools to monitor dependencies:
- Dependabot: Automated vulnerability alerts
- pip-audit: Python package vulnerability scanning
- npm audit: JavaScript dependency scanning
| Severity | Response Time |
|---|---|
| Critical | Within 24 hours |
| High | Within 7 days |
| Medium | Next release cycle |
| Low | Best effort |
# Python dependencies
pip install pip-audit
pip-audit
# Check for known vulnerabilities
safety check -r requirements.txtBefore submitting a PR, ensure:
- No hardcoded secrets or credentials
- All user input is validated
- File paths use
secure_filename() - No new external network calls introduced
- Error messages don't leak sensitive information
- Logging doesn't include sensitive data
- New dependencies are reviewed for vulnerabilities
We currently do not have a formal bug bounty program. However, we deeply appreciate security researchers who help improve our security posture. Responsible disclosure will be acknowledged in our release notes and contributors list.
- Security Issues: Create private advisory
- General Issues: GitHub Issues
Last updated: 2025-12-20