A web scraping workflow that uses LangGraph for orchestration, FireCrawl for intelligent web scraping, and LangChain with OpenAI for quality control and content validation.
This project implements an agentic scraping pipeline that:
- Maps website URLs using FireCrawl's site mapping to discover article URLs
- Crawls and scrapes content from discovered URLs using FireCrawl
- Validates content quality using LLM-based scoring (0-10 scale)
- Filters content based on minimum word count and quality scores
- Saves results to structured JSON/JSONL files with metadata
The workflow is orchestrated using LangGraph, which provides a stateful, graph-based approach to managing the scraping pipeline.
- 🗺️ Site Mapping: Automatically discovers article URLs from websites using FireCrawl
- 🔍 Intelligent Scraping: Uses FireCrawl to crawl and extract content from discovered URLs
- 🤖 LLM Quality Control: Validates and scores content quality using OpenAI
- 📊 Structured Output: Saves URLs to JSON and content to JSONL with metadata
- 🔄 Stateful Workflow: LangGraph manages the entire pipeline state
- 📝 Comprehensive Logging: Detailed logs for workflow execution and LLM interactions
- Python 3.11 or higher
- FireCrawl API key
- OpenAI API key
-
Clone the repository:
git clone <repository-url> cd agentic_project
-
Create a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.envfile in the project root:FIRECRAWL_API_KEY=your_firecrawl_api_key_here OPENAI_API_KEY=your_openai_api_key_here
Note: Get your FireCrawl API key from firecrawl.dev
Get your OpenAI API key from platform.openai.com
agentic_project/
├── run.py # Main entry point and LangGraph workflow
├── scrape_firecrawl.py # FireCrawl mapping and crawling nodes
├── src/
│ ├── objects.py # ScrapeState TypedDict definition
│ └── utils.py # Utility functions (URL cleaning, etc.)
├── data/ # Output directory
│ ├── urls/ # Saved URL lists (JSON format)
│ └── text/ # Scraped content (JSONL format)
├── logs/ # Log files
│ ├── langgraph.log # LangGraph workflow logs
│ └── scrape.log # General scraping logs
├── requirements.txt # Python dependencies
└── .env # Environment variables (not in git)
-
Edit the base URL in
run.py(line ~126):base = "https://www.growth-memo.com" # change if needed
-
Configure scraping parameters in the state initialization:
state: ScrapeState = { "base_url": base, "queue": [base], "urls_path": f"data/urls/urls_{timestamp}_{site}.json", "out_path": f"data/text/scraped_{timestamp}_{site}.jsonl", "max_posts": 3, # Number of pages to scrape "min_words": 400, # Minimum word count to save # ... other settings }
-
Run the scraper:
python run.py
The pipeline generates two types of output files:
- URLs file:
data/urls/urls_<timestamp>_<domain>.json- List of discovered article URLs - Content file:
data/text/scraped_<timestamp>_<domain>.jsonl- Scraped content with metadata
Each line in the JSONL file contains:
{
"url": "https://example.com/page",
"base_url": "https://example.com",
"base_url_hash": "abc12345",
"url_hash": "def67890",
"word_count": 1234,
"text": "Full article text...",
"content_score": 8,
"metadata": {
"title": "Page Title",
"description": "Page description",
"source": "https://example.com/page",
...
}
}The LangGraph workflow consists of three main nodes:
-
map_urls(Site Mapping):- Maps the website using FireCrawl to discover all URLs
- Filters for article URLs (containing
/p/in path) - Limits to
max_postsURLs - Saves URLs to
data/urls/urls_<timestamp>_<domain>.json
-
crawl_urls(Content Scraping):- Reads URLs from the saved JSON file
- Uses FireCrawl to scrape content from each URL
- Extracts content in markdown format
- Stores Documents in state
-
validate_and_save(Quality Control):- Processes each Document
- Filters by minimum word count
- Validates content quality using LLM (0-10 score)
- Saves valid content to
data/text/scraped_<timestamp>_<domain>.jsonl
map_urls → crawl_urls → validate_and_save → END
| Variable | Description | Required |
|---|---|---|
FIRECRAWL_API_KEY |
Your FireCrawl API key | Yes |
OPENAI_API_KEY |
Your OpenAI API key | Yes |
Edit these in run.py:
max_posts: Maximum number of pages to scrape (default: 3)min_words: Minimum word count to save a document (default: 400)base_url: Starting URL for crawling
The LLM model can be changed in run.py (line ~24):
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)- firecrawl: Web scraping and crawling
- langchain[openai]: LLM integration and document handling
- langchain-community: Community integrations (FireCrawlLoader)
- langgraph: Workflow orchestration
- python-dotenv: Environment variable management
The project generates two log files:
-
logs/langgraph.log: Detailed LangGraph workflow execution, including:- Node start/end events
- LLM prompts and full responses
- Routing decisions
- State transitions
-
logs/scrape.log: General scraping operations and errors
Content is scored on a 0-10 scale:
- 8-10: Real article content, substantial text, informative
- 5-7: Some content but may be partial or mixed with navigation
- 0-4: Navigation, ads, paywall, or very little content
Documents with scores below the threshold or word counts below min_words are filtered out.