Stop Coding Scrapers, Start Getting Data β from Hours to Seconds
An AI-powered web scraping agent that automatically generates production-ready parser code from HTML samples β no manual XPath/CSS selector writing required.
20260120204054.mp4
The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages
| Precision | Recall | F1 Score | |
|---|---|---|---|
| COT | 87.75 | 79.90 | 76.95 |
| Reflexion | 93.28 | 82.76 | 82.40 |
| AUTOSCRAPER | 92.49 | 89.13 | 88.69 |
| Web2JSON-Agent | 91.50 | 90.46 | 89.93 |
# 1. Install package
pip install web2json-agent
# 2. Initialize configuration
web2json setup# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent
# 2. Install in editable mode
pip install -e .
# 3. Initialize configuration
web2json setupWeb2JSON provides five simple APIs. Perfect for databases, APIs, and real-time processing!
Extract structured data from HTML in one step (schema + parser + data).
Auto Mode - Let AI automatically discover and extract fields:
from web2json import Web2JsonConfig, extract_data
config = Web2JsonConfig(
name="my_project",
html_path="html_samples/",
# iteration_rounds=3 # default 3
# enable_schema_edit=True # Uncomment to manually edit schema
)
result = extract_data(config)
# print(result.final_schema) # Dict: extracted schema
# print(result.parser_code) # str: generated parser code
# print(result.parsed_data[0]) # List[Dict]: parsed JSON dataPredefined Mode - Extract only specific fields:
config = Web2JsonConfig(
name="articles",
html_path="html_samples/",
schema={
"title": "string",
"author": "string",
"date": "string",
"content": "string"
}
)
result = extract_data(config)
# Returns: ExtractDataResult with schema, code, and dataGenerate a JSON schema describing the data structure in HTML.
from web2json import Web2JsonConfig, extract_schema
config = Web2JsonConfig(
name="schema_only",
html_path="html_samples/",
# iteration_rounds=3
# enable_schema_edit=True # Uncomment to manually edit schema
)
result = extract_schema(config)
# print(result.final_schema) # Dict: final schema
# print(result.intermediate_schemas) # List[Dict]: iteration historyGenerate parser code from a schema (Dict or from previous step).
from web2json import Web2JsonConfig, infer_code
# Use schema from previous step or define manually
my_schema = {
"title": "string",
"author": "string",
"content": "string"
}
config = Web2JsonConfig(
name="my_parser",
html_path="html_samples/",
schema=my_schema
)
result = infer_code(config)
# print(result.parser_code) # str: BeautifulSoup parser code
# print(result.schema) # Dict: schema usedUse parser code to extract data from HTML files.
from web2json import Web2JsonConfig, extract_data_with_code
# Parser code from previous step or loaded from file
parser_code = """
def parse_html(html_content):
# ... parser implementation
"""
config = Web2JsonConfig(
name="parse_demo",
html_path="new_html_files/",
parser_code=parser_code
)
result = extract_data_with_code(config)
# print(f"Success: {result.success_count}, Failed: {result.failed_count}")
# for item in result.parsed_data:
# print(f"File: {item['filename']}")
# print(f"Data: {item['data']}")Group HTML files by layout similarity (for mixed-layout datasets).
from web2json import Web2JsonConfig, classify_html_dir
config = Web2JsonConfig(
name="classify_demo",
html_path="mixed_html/"
)
result = classify_html_dir(config)
# print(f"Found {result.cluster_count} layout types")
# print(f"Noise files: {len(result.noise_files)}")
# for cluster_name, files in result.clusters.items():
# print(f"{cluster_name}: {len(files)} files")
# for file in files[:3]:
# print(f" - {file}")Web2JsonConfig Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
Required | Project name (for identification) |
html_path |
str |
Required | HTML directory or file path |
iteration_rounds |
int |
3 |
Number of samples for learning |
schema |
Dict |
None |
Predefined schema (None = auto mode) |
enable_schema_edit |
bool |
False |
Enable manual schema editing |
Standalone API Parameters:
| API | Parameters | Returns |
|---|---|---|
extract_data |
config: Web2JsonConfig |
ExtractDataResult |
extract_schema |
config: Web2JsonConfig |
ExtractSchemaResult |
infer_code |
config: Web2JsonConfig |
InferCodeResult |
extract_data_with_code |
config: Web2JsonConfig |
ParseResult |
classify_html_dir |
config: Web2JsonConfig |
ClusterResult |
All result objects provide:
- Direct access to data via object attributes
.to_dict()method for serialization.get_summary()method for quick stats
# Need data immediately? β extract_data
config = Web2JsonConfig(name="my_run", html_path="html_samples/")
result = extract_data(config)
print(result.parsed_data)
# Want to review/edit schema first? β extract_schema + infer_code
config = Web2JsonConfig(name="schema_run", html_path="html_samples/")
schema_result = extract_schema(config)
# Edit schema if needed, then generate code
config = Web2JsonConfig(
name="code_run",
html_path="html_samples/",
schema=schema_result.final_schema
)
code_result = infer_code(config)
# Parse with the generated code
config = Web2JsonConfig(
name="parse_run",
html_path="new_html_files/",
parser_code=code_result.parser_code
)
data_result = extract_data_with_code(config)
# Have parser code, need to parse more files? β extract_data_with_code
config = Web2JsonConfig(
name="parse_more",
html_path="more_files/",
parser_code=my_parser_code
)
result = extract_data_with_code(config)
# Mixed layouts (list + detail pages)? β classify_html_dir
config = Web2JsonConfig(name="classify", html_path="mixed_html/")
result = classify_html_dir(config)Apache-2.0 License
Made with β€οΈ by the web2json-agent team
β Star us on GitHub | π Report Issues | π Documentation