Skip to content

Rahkovsky/agentic_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Web Scraping with LangGraph

A web scraping workflow that uses LangGraph for orchestration, FireCrawl for intelligent web scraping, and LangChain with OpenAI for quality control and content validation.

Overview

This project implements an agentic scraping pipeline that:

  1. Maps website URLs using FireCrawl's site mapping to discover article URLs
  2. Crawls and scrapes content from discovered URLs using FireCrawl
  3. Validates content quality using LLM-based scoring (0-10 scale)
  4. Filters content based on minimum word count and quality scores
  5. Saves results to structured JSON/JSONL files with metadata

The workflow is orchestrated using LangGraph, which provides a stateful, graph-based approach to managing the scraping pipeline.

Features

  • 🗺️ Site Mapping: Automatically discovers article URLs from websites using FireCrawl
  • 🔍 Intelligent Scraping: Uses FireCrawl to crawl and extract content from discovered URLs
  • 🤖 LLM Quality Control: Validates and scores content quality using OpenAI
  • 📊 Structured Output: Saves URLs to JSON and content to JSONL with metadata
  • 🔄 Stateful Workflow: LangGraph manages the entire pipeline state
  • 📝 Comprehensive Logging: Detailed logs for workflow execution and LLM interactions

Installation

Prerequisites

  • Python 3.11 or higher
  • FireCrawl API key
  • OpenAI API key

Setup

  1. Clone the repository:

    git clone <repository-url>
    cd agentic_project
  2. Create a virtual environment (recommended):

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Create a .env file in the project root:

    FIRECRAWL_API_KEY=your_firecrawl_api_key_here
    OPENAI_API_KEY=your_openai_api_key_here

    Note: Get your FireCrawl API key from firecrawl.dev
    Get your OpenAI API key from platform.openai.com

Project Structure

agentic_project/
├── run.py                 # Main entry point and LangGraph workflow
├── scrape_firecrawl.py    # FireCrawl mapping and crawling nodes
├── src/
│   ├── objects.py         # ScrapeState TypedDict definition
│   └── utils.py          # Utility functions (URL cleaning, etc.)
├── data/                  # Output directory
│   ├── urls/             # Saved URL lists (JSON format)
│   └── text/             # Scraped content (JSONL format)
├── logs/                  # Log files
│   ├── langgraph.log     # LangGraph workflow logs
│   └── scrape.log        # General scraping logs
├── requirements.txt       # Python dependencies
└── .env                  # Environment variables (not in git)

Usage

Basic Usage

  1. Edit the base URL in run.py (line ~126):

    base = "https://www.growth-memo.com"  # change if needed
  2. Configure scraping parameters in the state initialization:

    state: ScrapeState = {
        "base_url": base,
        "queue": [base],
        "urls_path": f"data/urls/urls_{timestamp}_{site}.json",
        "out_path": f"data/text/scraped_{timestamp}_{site}.jsonl",
        "max_posts": 3,        # Number of pages to scrape
        "min_words": 400,      # Minimum word count to save
        # ... other settings
    }
  3. Run the scraper:

    python run.py

Output

The pipeline generates two types of output files:

  1. URLs file: data/urls/urls_<timestamp>_<domain>.json - List of discovered article URLs
  2. Content file: data/text/scraped_<timestamp>_<domain>.jsonl - Scraped content with metadata

Each line in the JSONL file contains:

{
  "url": "https://example.com/page",
  "base_url": "https://example.com",
  "base_url_hash": "abc12345",
  "url_hash": "def67890",
  "word_count": 1234,
  "text": "Full article text...",
  "content_score": 8,
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "source": "https://example.com/page",
    ...
  }
}

Workflow

The LangGraph workflow consists of three main nodes:

  1. map_urls (Site Mapping):

    • Maps the website using FireCrawl to discover all URLs
    • Filters for article URLs (containing /p/ in path)
    • Limits to max_posts URLs
    • Saves URLs to data/urls/urls_<timestamp>_<domain>.json
  2. crawl_urls (Content Scraping):

    • Reads URLs from the saved JSON file
    • Uses FireCrawl to scrape content from each URL
    • Extracts content in markdown format
    • Stores Documents in state
  3. validate_and_save (Quality Control):

    • Processes each Document
    • Filters by minimum word count
    • Validates content quality using LLM (0-10 score)
    • Saves valid content to data/text/scraped_<timestamp>_<domain>.jsonl
map_urls → crawl_urls → validate_and_save → END

Configuration

Environment Variables (.env)

Variable Description Required
FIRECRAWL_API_KEY Your FireCrawl API key Yes
OPENAI_API_KEY Your OpenAI API key Yes

Scraping Parameters

Edit these in run.py:

  • max_posts: Maximum number of pages to scrape (default: 3)
  • min_words: Minimum word count to save a document (default: 400)
  • base_url: Starting URL for crawling

LLM Configuration

The LLM model can be changed in run.py (line ~24):

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

Dependencies

  • firecrawl: Web scraping and crawling
  • langchain[openai]: LLM integration and document handling
  • langchain-community: Community integrations (FireCrawlLoader)
  • langgraph: Workflow orchestration
  • python-dotenv: Environment variable management

Logging

The project generates two log files:

  • logs/langgraph.log: Detailed LangGraph workflow execution, including:

    • Node start/end events
    • LLM prompts and full responses
    • Routing decisions
    • State transitions
  • logs/scrape.log: General scraping operations and errors

Quality Control

Content is scored on a 0-10 scale:

  • 8-10: Real article content, substantial text, informative
  • 5-7: Some content but may be partial or mixed with navigation
  • 0-4: Navigation, ads, paywall, or very little content

Documents with scores below the threshold or word counts below min_words are filtered out.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages