Agentic Web Scraping with LangGraph

A web scraping workflow that uses LangGraph for orchestration, FireCrawl for intelligent web scraping, and LangChain with OpenAI for quality control and content validation.

Overview

This project implements an agentic scraping pipeline that:

Maps website URLs using FireCrawl's site mapping to discover article URLs
Crawls and scrapes content from discovered URLs using FireCrawl
Validates content quality using LLM-based scoring (0-10 scale)
Filters content based on minimum word count and quality scores
Saves results to structured JSON/JSONL files with metadata

The workflow is orchestrated using LangGraph, which provides a stateful, graph-based approach to managing the scraping pipeline.

Features

🗺️ Site Mapping: Automatically discovers article URLs from websites using FireCrawl
🔍 Intelligent Scraping: Uses FireCrawl to crawl and extract content from discovered URLs
🤖 LLM Quality Control: Validates and scores content quality using OpenAI
📊 Structured Output: Saves URLs to JSON and content to JSONL with metadata
🔄 Stateful Workflow: LangGraph manages the entire pipeline state
📝 Comprehensive Logging: Detailed logs for workflow execution and LLM interactions

Installation

Prerequisites

Python 3.11 or higher
FireCrawl API key
OpenAI API key

Setup

Clone the repository:

git clone <repository-url>
cd agentic_project

Create a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Create a .env file in the project root:
```
FIRECRAWL_API_KEY=your_firecrawl_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
```
Note: Get your FireCrawl API key from firecrawl.dev
Get your OpenAI API key from platform.openai.com

Project Structure

agentic_project/
├── run.py                 # Main entry point and LangGraph workflow
├── scrape_firecrawl.py    # FireCrawl mapping and crawling nodes
├── src/
│   ├── objects.py         # ScrapeState TypedDict definition
│   └── utils.py          # Utility functions (URL cleaning, etc.)
├── data/                  # Output directory
│   ├── urls/             # Saved URL lists (JSON format)
│   └── text/             # Scraped content (JSONL format)
├── logs/                  # Log files
│   ├── langgraph.log     # LangGraph workflow logs
│   └── scrape.log        # General scraping logs
├── requirements.txt       # Python dependencies
└── .env                  # Environment variables (not in git)

Usage

Basic Usage

Edit the base URL in run.py (line ~126):

base = "https://www.growth-memo.com"  # change if needed

Configure scraping parameters in the state initialization:

state: ScrapeState = {
    "base_url": base,
    "queue": [base],
    "urls_path": f"data/urls/urls_{timestamp}_{site}.json",
    "out_path": f"data/text/scraped_{timestamp}_{site}.jsonl",
    "max_posts": 3,        # Number of pages to scrape
    "min_words": 400,      # Minimum word count to save
    # ... other settings
}

Run the scraper:
```
python run.py
```

Output

The pipeline generates two types of output files:

URLs file: data/urls/urls_<timestamp>_<domain>.json - List of discovered article URLs
Content file: data/text/scraped_<timestamp>_<domain>.jsonl - Scraped content with metadata

Each line in the JSONL file contains:

{
  "url": "https://example.com/page",
  "base_url": "https://example.com",
  "base_url_hash": "abc12345",
  "url_hash": "def67890",
  "word_count": 1234,
  "text": "Full article text...",
  "content_score": 8,
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "source": "https://example.com/page",
    ...
  }
}

Workflow

The LangGraph workflow consists of three main nodes:

map_urls (Site Mapping):
- Maps the website using FireCrawl to discover all URLs
- Filters for article URLs (containing /p/ in path)
- Limits to max_posts URLs
- Saves URLs to data/urls/urls_<timestamp>_<domain>.json
crawl_urls (Content Scraping):
- Reads URLs from the saved JSON file
- Uses FireCrawl to scrape content from each URL
- Extracts content in markdown format
- Stores Documents in state
validate_and_save (Quality Control):
- Processes each Document
- Filters by minimum word count
- Validates content quality using LLM (0-10 score)
- Saves valid content to data/text/scraped_<timestamp>_<domain>.jsonl

map_urls → crawl_urls → validate_and_save → END

Configuration

Environment Variables (`.env`)

Variable	Description	Required
`FIRECRAWL_API_KEY`	Your FireCrawl API key	Yes
`OPENAI_API_KEY`	Your OpenAI API key	Yes

Scraping Parameters

Edit these in run.py:

max_posts: Maximum number of pages to scrape (default: 3)
min_words: Minimum word count to save a document (default: 400)
base_url: Starting URL for crawling

LLM Configuration

The LLM model can be changed in run.py (line ~24):

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

Dependencies

firecrawl: Web scraping and crawling
langchain[openai]: LLM integration and document handling
langchain-community: Community integrations (FireCrawlLoader)
langgraph: Workflow orchestration
python-dotenv: Environment variable management

Logging

The project generates two log files:

logs/langgraph.log: Detailed LangGraph workflow execution, including:
- Node start/end events
- LLM prompts and full responses
- Routing decisions
- State transitions
logs/scrape.log: General scraping operations and errors

Quality Control

Content is scored on a 0-10 scale:

8-10: Real article content, substantial text, informative
5-7: Some content but may be partial or mixed with navigation
0-4: Navigation, ads, paywall, or very little content

Documents with scores below the threshold or word counts below min_words are filtered out.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cursor		.cursor
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
scrape_firecrawl.py		scrape_firecrawl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Web Scraping with LangGraph

Overview

Features

Installation

Prerequisites

Setup

Project Structure

Usage

Basic Usage

Output

Workflow

Configuration

Environment Variables (`.env`)

Scraping Parameters

LLM Configuration

Dependencies

Logging

Quality Control

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Web Scraping with LangGraph

Overview

Features

Installation

Prerequisites

Setup

Project Structure

Usage

Basic Usage

Output

Workflow

Configuration

Environment Variables (.env)

Scraping Parameters

LLM Configuration

Dependencies

Logging

Quality Control

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment Variables (`.env`)

Packages