Falcon Web Scraping

Falcon is a Spring Boot-based web scraping service that allows users to fetch, parse, and clean HTML content from multiple URLs in parallel. It provides a robust, asynchronous handling of requests, API for extracting structured data from web pages (HTML), with built-in validation and cleaning capabilities.

Features

Parallel Web Scraping: Fetch multiple web pages concurrently using a configurable thread pool. that enables effictive without being blocked by slow responses or timeouts.
HTML Parsing: Extract specific elements from the HTML based on the provided JSON that specifying the needed elements from the page using CSS selectors (e.g., links, images, headings).
Data Cleaning: Clean parsed text with options like lowercasing, removing numbers, special characters, and trimming whitespace.
Validation: Ensures only valid HTTP/HTTPS URLs and safe parsing parameters are processed.
Extensible: Easily add new parsing or cleaning strategies.

Technologies Used

Java 21
Spring Boot
Spring Web
Spring Async
Jsoup (HTML parsing)
Lombok

Project Structure

src/
  main/
    java/
      com/proxy/falcon/
        Proxy/           # Proxy service, controller, configs, DTOs
        Parser/          # HTML parsing service
        cleaner/         # Data cleaning service
        Exception/       # Custom exceptions and global exception handler

API Endpoints

1. Scraping Endpoint

POST /proxy/api/scrap

Request Body

Send a JSON object matching the ScrapingRequest DTO:

{
  "urls": ["https://example.com", "https://another.com"],
  "parsParams": {
      "article.post": "h1.title",
      "ul.products li.product": "a.name"
  },
  "cleanParams": ["lowercase", "remove_numbers", "strip_whitespace"]
}

urls: Array of HTTP/HTTPS URLs to scrape.
parsParams: A map of CSS selectors for parsing HTML where the key is the container/parent selector and the value is the inner selector whose text will be extracted. Example: { "article.post": "h1.title" } extracts the text of h1.title inside each article.post.
cleanParams: Array of cleaning operations (lowercase, remove_numbers, remove_special_chars, strip_whitespace).

Optional Headers

You can send custom headers as a second JSON object in the request body (if supported by your client), or as HTTP headers.

Response

Returns a JSON object matching the ScrapingResults DTO:

{
  "results": [
    "example link text",
    "another link text",
    "image alt text"
  ]
}

results: Array of cleaned, parsed strings extracted from the provided URLs.

Error Handling

Returns 400 Bad Request for invalid URLs, parsing parameters, or cleaning parameters.

How It Works

Validation: Incoming URLs and parsing parameters are validated for safety and correctness.
Parallel Scraping: URLs are fetched in parallel using a thread pool.
Parsing: HTML content is parsed using Jsoup and the provided CSS selectors.
Cleaning: Parsed strings are cleaned according to the specified cleaning parameters.
Response: The cleaned, parsed results are returned as a JSON array.

Example Usage

cURL

curl -X POST http://localhost:8080/proxy/api/scrap \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "parsParams": { "div.card": "h2.title" },
    "cleanParams": ["lowercase", "strip_whitespace"]
  }'

Configuration

Thread pool settings and other configurations can be adjusted in ProxyConfigs.java or via application.properties in your env.

Contact

For questions or contributions, please open an issue or pull request on GitHub or you can contact me on X @AFA_24a.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
.mvn/wrapper		.mvn/wrapper
src		src
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Falcon Web Scraping

Features

Technologies Used

Project Structure

API Endpoints

1. Scraping Endpoint

Request Body

Optional Headers

Response

Error Handling

How It Works

Example Usage

cURL

Configuration

Contact

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Falcon Web Scraping

Features

Technologies Used

Project Structure

API Endpoints

1. Scraping Endpoint

Request Body

Optional Headers

Response

Error Handling

How It Works

Example Usage

cURL

Configuration

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages