Falcon is a Spring Boot-based web scraping service that allows users to fetch, parse, and clean HTML content from multiple URLs in parallel. It provides a robust, asynchronous handling of requests, API for extracting structured data from web pages (HTML), with built-in validation and cleaning capabilities.
- Parallel Web Scraping: Fetch multiple web pages concurrently using a configurable thread pool. that enables effictive without being blocked by slow responses or timeouts.
- HTML Parsing: Extract specific elements from the HTML based on the provided JSON that specifying the needed elements from the page using CSS selectors (e.g., links, images, headings).
- Data Cleaning: Clean parsed text with options like lowercasing, removing numbers, special characters, and trimming whitespace.
- Validation: Ensures only valid HTTP/HTTPS URLs and safe parsing parameters are processed.
- Extensible: Easily add new parsing or cleaning strategies.
- Java 21
- Spring Boot
- Spring Web
- Spring Async
- Jsoup (HTML parsing)
- Lombok
src/
main/
java/
com/proxy/falcon/
Proxy/ # Proxy service, controller, configs, DTOs
Parser/ # HTML parsing service
cleaner/ # Data cleaning service
Exception/ # Custom exceptions and global exception handler
POST /proxy/api/scrap
Send a JSON object matching the ScrapingRequest DTO:
{
"urls": ["https://example.com", "https://another.com"],
"parsParams": {
"article.post": "h1.title",
"ul.products li.product": "a.name"
},
"cleanParams": ["lowercase", "remove_numbers", "strip_whitespace"]
}urls: Array of HTTP/HTTPS URLs to scrape.parsParams: A map of CSS selectors for parsing HTML where the key is the container/parent selector and the value is the inner selector whose text will be extracted. Example:{ "article.post": "h1.title" }extracts the text ofh1.titleinside eacharticle.post.cleanParams: Array of cleaning operations (lowercase,remove_numbers,remove_special_chars,strip_whitespace).
You can send custom headers as a second JSON object in the request body (if supported by your client), or as HTTP headers.
Returns a JSON object matching the ScrapingResults DTO:
{
"results": [
"example link text",
"another link text",
"image alt text"
]
}results: Array of cleaned, parsed strings extracted from the provided URLs.
- Returns
400 Bad Requestfor invalid URLs, parsing parameters, or cleaning parameters.
- Validation: Incoming URLs and parsing parameters are validated for safety and correctness.
- Parallel Scraping: URLs are fetched in parallel using a thread pool.
- Parsing: HTML content is parsed using Jsoup and the provided CSS selectors.
- Cleaning: Parsed strings are cleaned according to the specified cleaning parameters.
- Response: The cleaned, parsed results are returned as a JSON array.
curl -X POST http://localhost:8080/proxy/api/scrap \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"parsParams": { "div.card": "h2.title" },
"cleanParams": ["lowercase", "strip_whitespace"]
}'Thread pool settings and other configurations can be adjusted in ProxyConfigs.java or via application.properties in your env.
For questions or contributions, please open an issue or pull request on GitHub or you can contact me on X @AFA_24a.