|
| 1 | +# HTML Sanitization |
| 2 | + |
| 3 | +VibeReader implements comprehensive HTML sanitization to prevent XSS (Cross-Site Scripting) attacks from malicious feed content. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Feed content from RSS/Atom/JSON feeds can contain HTML, which could potentially include malicious scripts. This implementation sanitizes all feed content at multiple layers: |
| 8 | + |
| 9 | +1. **Server-side sanitization** - HTMLPurifier sanitizes content before storing in database |
| 10 | +2. **Client-side sanitization** - DOMPurify provides defense-in-depth when rendering content |
| 11 | + |
| 12 | +## Server-Side Sanitization (HTMLPurifier) |
| 13 | + |
| 14 | +### Implementation |
| 15 | + |
| 16 | +- **Library**: `ezyang/htmlpurifier` (v4.16+) |
| 17 | +- **Location**: `src/Utils/HtmlSanitizer.php` |
| 18 | +- **Integration**: Automatically applied in `FeedParser` when parsing feeds |
| 19 | + |
| 20 | +### What Gets Sanitized |
| 21 | + |
| 22 | +- **Feed titles** - Plain text (HTML entities escaped) |
| 23 | +- **Feed descriptions** - HTML sanitized |
| 24 | +- **Item titles** - Plain text (HTML entities escaped) |
| 25 | +- **Item content** - HTML sanitized (preserves formatting) |
| 26 | +- **Item summaries** - HTML sanitized |
| 27 | +- **Item authors** - Plain text (HTML entities escaped) |
| 28 | + |
| 29 | +### Allowed HTML Tags |
| 30 | + |
| 31 | +The sanitizer allows common formatting tags used in feed content: |
| 32 | +- Text formatting: `p`, `br`, `strong`, `b`, `em`, `i`, `u` |
| 33 | +- Links: `a[href|title|target]` |
| 34 | +- Lists: `ul`, `ol`, `li` |
| 35 | +- Code: `pre`, `code` |
| 36 | +- Images: `img[src|alt|width|height]` |
| 37 | +- Headings: `h1`, `h2`, `h3`, `h4`, `h5`, `h6` |
| 38 | +- Structure: `div`, `span[style]`, `blockquote` |
| 39 | +- Tables: `table`, `thead`, `tbody`, `tr`, `td`, `th` |
| 40 | + |
| 41 | +### Allowed Attributes |
| 42 | + |
| 43 | +- Links: `href`, `title`, `target`, `rel` |
| 44 | +- Images: `src`, `alt`, `width`, `height` |
| 45 | +- Styling: `style` (limited CSS properties) |
| 46 | +- Allowed CSS properties: `color`, `background-color`, `font-size`, `font-weight`, `font-style`, `text-align`, `text-decoration`, `margin`, `padding`, `border` |
| 47 | + |
| 48 | +### Configuration |
| 49 | + |
| 50 | +Sanitization can be disabled via environment variable: |
| 51 | + |
| 52 | +```bash |
| 53 | +SANITIZATION_ENABLED=0 # Disable sanitization (not recommended) |
| 54 | +``` |
| 55 | + |
| 56 | +**Default**: Enabled (`SANITIZATION_ENABLED=1`) |
| 57 | + |
| 58 | +### Cache |
| 59 | + |
| 60 | +HTMLPurifier uses a cache directory at `var/htmlpurifier/` to improve performance. This directory is automatically created and is excluded from Git. |
| 61 | + |
| 62 | +## Client-Side Sanitization (DOMPurify) |
| 63 | + |
| 64 | +### Implementation |
| 65 | + |
| 66 | +- **Library**: DOMPurify v3.3.1 (via CDN) |
| 67 | +- **Location**: Loaded in `views/dashboard.php` |
| 68 | +- **Integration**: Applied in `assets/js/modules/items.js` when rendering item content |
| 69 | + |
| 70 | +### Defense in Depth |
| 71 | + |
| 72 | +Even though content is sanitized server-side, DOMPurify provides an additional layer of protection: |
| 73 | +- Protects against any content that might bypass server-side sanitization |
| 74 | +- Handles edge cases in browser rendering |
| 75 | +- Provides real-time sanitization when content is displayed |
| 76 | + |
| 77 | +### Configuration |
| 78 | + |
| 79 | +DOMPurify uses the same allowed tags and attributes as the server-side sanitizer for consistency. |
| 80 | + |
| 81 | +## Usage |
| 82 | + |
| 83 | +### Server-Side |
| 84 | + |
| 85 | +```php |
| 86 | +use PhpRss\Utils\HtmlSanitizer; |
| 87 | + |
| 88 | +// Sanitize HTML content (preserves formatting) |
| 89 | +$cleanHtml = HtmlSanitizer::sanitize($feedContent); |
| 90 | + |
| 91 | +// Sanitize plain text (escapes HTML entities) |
| 92 | +$cleanText = HtmlSanitizer::sanitizeText($feedTitle); |
| 93 | +``` |
| 94 | + |
| 95 | +### Client-Side |
| 96 | + |
| 97 | +```javascript |
| 98 | +// Sanitize HTML before setting innerHTML |
| 99 | +const sanitized = DOMPurify.sanitize(content, { |
| 100 | + ALLOWED_TAGS: ['p', 'br', 'strong', 'b', 'em', 'i', 'u', 'a', 'ul', 'ol', 'li', 'blockquote', 'pre', 'code', 'img', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'div', 'span', 'table', 'thead', 'tbody', 'tr', 'td', 'th'], |
| 101 | + ALLOWED_ATTR: ['href', 'title', 'target', 'src', 'alt', 'width', 'height', 'style', 'rel'], |
| 102 | + ALLOW_DATA_ATTR: false |
| 103 | +}); |
| 104 | + |
| 105 | +element.innerHTML = sanitized; |
| 106 | +``` |
| 107 | + |
| 108 | +## Security Benefits |
| 109 | + |
| 110 | +1. **Prevents Stored XSS** - Malicious scripts in feed content are removed before storage |
| 111 | +2. **Prevents Reflected XSS** - Content is sanitized before being sent to the browser |
| 112 | +3. **Defense in Depth** - Multiple layers of sanitization (server + client) |
| 113 | +4. **Preserves Formatting** - Legitimate HTML formatting is maintained |
| 114 | +5. **Configurable** - Can be disabled if needed (though not recommended) |
| 115 | + |
| 116 | +## Performance |
| 117 | + |
| 118 | +- **HTMLPurifier**: Uses caching to improve performance on repeated sanitization |
| 119 | +- **DOMPurify**: Lightweight client-side library with minimal performance impact |
| 120 | +- **Caching**: HTMLPurifier cache stored in `var/htmlpurifier/` (excluded from Git) |
| 121 | + |
| 122 | +## Troubleshooting |
| 123 | + |
| 124 | +### Content Appears Stripped |
| 125 | + |
| 126 | +If legitimate content is being removed: |
| 127 | +1. Check HTMLPurifier logs for warnings |
| 128 | +2. Verify the content uses allowed tags/attributes |
| 129 | +3. Review `src/Utils/HtmlSanitizer.php` configuration |
| 130 | + |
| 131 | +### Sanitization Not Working |
| 132 | + |
| 133 | +1. Verify `SANITIZATION_ENABLED=1` in environment |
| 134 | +2. Check that HTMLPurifier is installed: `composer show ezyang/htmlpurifier` |
| 135 | +3. Verify DOMPurify is loaded (check browser console) |
| 136 | +4. Check that `var/htmlpurifier/` directory is writable |
| 137 | + |
| 138 | +### Disabling Sanitization |
| 139 | + |
| 140 | +**Not Recommended** - Only disable for debugging: |
| 141 | + |
| 142 | +```bash |
| 143 | +SANITIZATION_ENABLED=0 |
| 144 | +``` |
| 145 | + |
| 146 | +This will bypass server-side sanitization. Client-side DOMPurify will still sanitize content. |
| 147 | + |
| 148 | +## Files Modified |
| 149 | + |
| 150 | +- `src/Utils/HtmlSanitizer.php` (new) - HTML sanitization utility |
| 151 | +- `src/FeedParser.php` - Integrated sanitization into all parsing methods |
| 152 | +- `src/Config.php` - Added sanitization configuration |
| 153 | +- `assets/js/modules/items.js` - Added DOMPurify client-side sanitization |
| 154 | +- `views/dashboard.php` - Added DOMPurify CDN script |
| 155 | +- `composer.json` - Added HTMLPurifier dependency |
| 156 | +- `ENV_CONFIGURATION.md` - Added sanitization configuration documentation |
| 157 | +- `.gitignore` - Added HTMLPurifier cache directory |
| 158 | + |
| 159 | +## Testing |
| 160 | + |
| 161 | +To test sanitization: |
| 162 | + |
| 163 | +1. **Test with malicious content**: |
| 164 | + ```php |
| 165 | + $malicious = '<script>alert("XSS")</script><p>Safe content</p>'; |
| 166 | + $sanitized = HtmlSanitizer::sanitize($malicious); |
| 167 | + // Result: '<p>Safe content</p>' (script removed) |
| 168 | + ``` |
| 169 | + |
| 170 | +2. **Test with legitimate HTML**: |
| 171 | + ```php |
| 172 | + $legitimate = '<p>This is <strong>bold</strong> text with a <a href="https://example.com">link</a>.</p>'; |
| 173 | + $sanitized = HtmlSanitizer::sanitize($legitimate); |
| 174 | + // Result: Same content (preserved) |
| 175 | + ``` |
| 176 | + |
| 177 | +## References |
| 178 | + |
| 179 | +- [HTMLPurifier Documentation](https://htmlpurifier.org/) |
| 180 | +- [DOMPurify Documentation](https://github.com/cure53/DOMPurify) |
| 181 | +- [OWASP XSS Prevention](https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html) |
0 commit comments