web-agent-master · rocklau · Mar 6, 2025
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# Google Search Tool
+# Search Tool
 
-A Playwright-based Node.js tool that bypasses search engine anti-scraping mechanisms to execute Google searches and extract results. It can be used directly as a command-line tool or as a Model Context Protocol (MCP) server to provide real-time search capabilities to AI assistants like Claude.
+A Playwright-based Node.js tool that bypasses search engine anti-scraping mechanisms to execute Google and Bing searches and extract results. It can be used directly as a command-line tool or as a Model Context Protocol (MCP) server to provide real-time search capabilities to AI assistants like Claude.
 
 [![Star History Chart](https://api.star-history.com/svg?repos=web-agent-master/google-search&type=Date)](https://star-history.com/#web-agent-master/google-search&Date)
 
@@ -9,6 +9,8 @@ A Playwright-based Node.js tool that bypasses search engine anti-scraping mechan
 ## Key Features
 
 - **Local SERP API Alternative**: No need to rely on paid search engine results API services, all searches are executed locally
+- **Multiple Search Engines Support**: Currently supports Google and Bing search engines
+- **URL Content Crawler**: Extract content from any web page with customizable selectors and metadata extraction
 - **Advanced Anti-Bot Detection Bypass Techniques**:
   - Intelligent browser fingerprint management that simulates real user behavior
   - Automatic saving and restoration of browser state to reduce verification frequency
@@ -24,6 +26,7 @@ A Playwright-based Node.js tool that bypasses search engine anti-scraping mechan
 - Command-line parameter support for search keywords
 - MCP server support for AI assistant integration
 - Returns search results with title, link, and snippet
+- URL crawler with customizable content extraction and metadata support
 - JSON format output
 - Support for both headless and headed modes (for debugging)
 - Detailed logging output
@@ -69,35 +72,52 @@ This tool has been specially adapted for Windows environments:
 
 ## Usage
 
-### Command Line Tool
+### Command Line
 
 ```bash
-# Direct command line usage
-google-search "search keywords"
-
-# Using command line options
-google-search --limit 5 --timeout 60000 --no-headless "search keywords"
+# Google search
+npx google-search "your search query"
+# Or with options
+npx google-search --limit 5 "your search query"
+
+# Bing search
+npx bing-search "your search query"
+# Or with options
+npx bing-search --limit 5 "your search query"
+
+# URL crawler
+npx url-crawler "https://example.com"
+# Or with options
+npx url-crawler -s "article.main-content" -w "div.loaded-content" -t 30000 "https://example.com"
+```
 
-# Or using npx
-npx google-search-cli "search keywords"
+You can also use the subcommands:
 
-# Run in development mode
-pnpm dev "search keywords"
+```bash
+# Google search
+npx google-search google "your search query"
 
-# Run in debug mode (showing browser interface)
-pnpm debug "search keywords"
+# Bing search
+npx google-search bing "your search query"
 ```
 
-#### Command Line Options
+### Options
 
-- `-l, --limit <number>`: Result count limit (default: 10)
-- `-t, --timeout <number>`: Timeout in milliseconds (default: 60000)
-- `--no-headless`: Show browser interface (for debugging)
-- `--remote-debugging-port <number>`: Enable remote debugging port (default: 9222)
-- `--state-file <path>`: Browser state file path (default: ./browser-state.json)
+#### Search Options
+- `--limit <number>`: Limit the number of results (default: 10)
+- `--timeout <number>`: Set timeout in milliseconds (default: 30000)
+- `--state-file <path>`: Specify browser state file path (default: ./browser-state.json)
+- `--no-save-state`: Don't save browser state
+- `--locale <locale>`: Specify search result language (default: zh-CN)
+
+#### URL Crawler Options
+- `-s, --selector <selector>`: CSS selector to extract specific content
+- `-w, --wait-for <selector>`: Wait for specified element to appear before extracting content
+- `-t, --timeout <ms>`: Timeout in milliseconds (default: 30000)
+- `--no-metadata`: Don't extract metadata
+- `--no-headless`: Run browser in headed mode
 - `--no-save-state`: Don't save browser state
-- `-V, --version`: Display version number
-- `-h, --help`: Display help information
+- `--state-file <path>`: Specify browser state file path (default: ~/.url-crawler-browser-state.json)
 
 #### Output Example
 
@@ -125,15 +145,34 @@ pnpm debug "search keywords"
 }
 ```
 
+#### URL Crawler Output Example
+
+```json
+{
+  "url": "https://example.com",
+  "title": "Example Domain",
+  "content": "Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\nMore information...",
+  "metadata": {
+    "viewport": "width=device-width, initial-scale=1"
+  },
+  "timestamp": "2025-03-06T07:44:05.698Z"
+}
+```
+
 ### MCP Server
 
 This project provides Model Context Protocol (MCP) server functionality, allowing AI assistants like Claude to directly use Google search capabilities. MCP is an open protocol that enables AI assistants to safely access external tools and data.
 
 ```bash
-# Build the project
-pnpm build
+# Start the MCP server
+npx google-search-mcp
 ```
 
+The MCP server provides three tools:
+- `google-search`: For Google search
+- `bing-search`: For Bing search
+- `url-crawler`: For crawling and extracting content from URLs
+
 #### Integration with Claude Desktop
 
 1. Edit the Claude Desktop configuration file:

diff --git a/README.zh-CN.md b/README.zh-CN.md
@@ -1,19 +1,23 @@
-# Google 搜索工具
+# 搜索工具
 
-这是一个基于 Playwright 的 Node.js 工具，能够绕过搜索引擎的反爬虫机制，执行 Google 搜索并提取结果。它可作为命令行工具直接使用，或通过 Model Context Protocol (MCP) 服务器为 Claude 等 AI 助手提供实时搜索能力。
+一个基于 Playwright 的 Node.js 工具，能够绕过搜索引擎的反爬虫机制，执行 Google 和 Bing 搜索并提取结果。它可以直接作为命令行工具使用，也可以作为 Model Context Protocol (MCP) 服务器为 Claude 等 AI 助手提供实时搜索能力。
 
 [![Star History Chart](https://api.star-history.com/svg?repos=web-agent-master/google-search&type=Date)](https://star-history.com/#web-agent-master/google-search&Date)
 
-## 核心亮点
+[English Documentation](README.md)
 
-- **本地化 SERP API 替代方案**：无需依赖付费的搜索引擎结果 API 服务，完全在本地执行搜索操作
+## 主要特点
+
+- **本地 SERP API 替代方案**：无需依赖付费的搜索引擎结果 API 服务，所有搜索在本地执行
+- **多搜索引擎支持**：目前支持 Google 和 Bing 搜索引擎
+- **URL 内容爬取器**：可提取任何网页内容，支持自定义选择器和元数据提取
 - **先进的反机器人检测绕过技术**：
   - 智能浏览器指纹管理，模拟真实用户行为
   - 自动保存和恢复浏览器状态，减少验证频率
-  - 无头/有头模式智能切换，遇到验证时自动转为有头模式让用户完成验证
-  - 多种设备和区域设置随机化，降低被检测风险
-- **MCP 服务器集成**：为 Claude 等 AI 助手提供实时搜索能力，无需额外 API 密钥
-- **完全开源免费**：所有代码开源，无使用限制，可自由定制和扩展
+  - 智能无头/有头模式切换，在需要验证时自动切换到有头模式
+  - 设备和区域设置的随机化，降低检测风险
+- **MCP 服务器集成**：为 Claude 等 AI 助手提供实时搜索能力，无需额外的 API 密钥
+- **完全开源和免费**：所有代码开源，无使用限制，可自由定制和扩展
 
 ## 技术特性
 
@@ -22,6 +26,7 @@
 - 支持命令行参数输入搜索关键词
 - 支持作为 MCP 服务器，为 Claude 等 AI 助手提供搜索能力
 - 返回搜索结果的标题、链接和摘要
+- URL 爬取器支持自定义内容提取和元数据支持
 - 以 JSON 格式输出结果
 - 支持无头模式和有头模式（调试用）
 - 提供详细的日志输出
@@ -67,36 +72,52 @@ pnpm link
 
 ## 使用方法
 
-### 命令行工具
+### 命令行
 
 ```bash
-# 直接使用命令行
-google-search "搜索关键词"
-
-# 使用命令行选项
-google-search --limit 5 --timeout 60000 --no-headless "搜索关键词"
-
+# Google 搜索
+npx google-search "你的搜索查询"
+# 或者带选项
+npx google-search --limit 5 "你的搜索查询"
+
+# Bing 搜索
+npx bing-search "你的搜索查询"
+# 或者带选项
+npx bing-search --limit 5 "你的搜索查询"
+
+# URL 爬取器
+npx url-crawler "https://example.com"
+# 或者带选项
+npx url-crawler -s "article.main-content" -w "div.loaded-content" -t 30000 "https://example.com"
+```
 
-# 或者使用 npx
-npx google-search-cli "搜索关键词"
+你也可以使用子命令：
 
-# 开发模式运行
-pnpm dev "搜索关键词"
+```bash
+# Google 搜索
+npx google-search google "你的搜索查询"
 
-# 调试模式运行（显示浏览器界面）
-pnpm debug "搜索关键词"
+# Bing 搜索
+npx google-search bing "你的搜索查询"
 ```
 
-#### 命令行选项
+### 选项
+
+#### 搜索选项
+- `--limit <number>`：限制结果数量（默认：10）
+- `--timeout <number>`：设置超时时间（毫秒）（默认：30000）
+- `--state-file <path>`：指定浏览器状态文件路径（默认：./browser-state.json）
+- `--no-save-state`：不保存浏览器状态
+- `--locale <locale>`：指定搜索结果语言（默认：zh-CN）
 
-- `-l, --limit <number>`: 结果数量限制（默认：10）
-- `-t, --timeout <number>`: 超时时间（毫秒，默认：60000）
-- `--no-headless`: 显示浏览器界面（调试用）
-- `--remote-debugging-port <number>`: 启用远程调试端口（默认：9222）
-- `--state-file <path>`: 浏览器状态文件路径（默认：./browser-state.json）
-- `--no-save-state`: 不保存浏览器状态
-- `-V, --version`: 显示版本号
-- `-h, --help`: 显示帮助信息
+#### URL 爬取器选项
+- `-s, --selector <selector>`：CSS选择器，用于提取特定内容
+- `-w, --wait-for <selector>`：等待指定元素出现后再提取内容
+- `-t, --timeout <ms>`：超时时间(毫秒)（默认：30000）
+- `--no-metadata`：不提取元数据
+- `--no-headless`：使用有头模式运行浏览器
+- `--no-save-state`：不保存浏览器状态
+- `--state-file <path>`：指定浏览器状态文件路径（默认：~/.url-crawler-browser-state.json）
 
 #### 输出示例
 
@@ -124,15 +145,32 @@ pnpm debug "搜索关键词"
 }
 ```
 
-### MCP 服务器
+#### URL 爬取器输出示例
 
-本项目提供 Model Context Protocol (MCP) 服务器功能，让 Claude 等 AI 助手直接使用 Google 搜索能力。MCP 是一个开放协议，使 AI 助手能安全访问外部工具和数据。
+```json
+{
+  "url": "https://example.com",
+  "title": "Example Domain",
+  "content": "Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\nMore information...",
+  "metadata": {
+    "viewport": "width=device-width, initial-scale=1"
+  },
+  "timestamp": "2025-03-06T07:44:05.698Z"
+}
+```
+
+### MCP 服务器
 
 ```bash
-# 构建项目
-pnpm build
+# 启动 MCP 服务器
+npx google-search-mcp
 ```
 
+MCP 服务器提供两个工具：
+- `google-search`：用于 Google 搜索
+- `bing-search`：用于 Bing 搜索
+- `url-crawler`：用于爬取和提取URL内容
+
 #### 与 Claude Desktop 集成
 
 1. 编辑 Claude Desktop 配置文件

diff --git a/bin/bing-search b/bin/bing-search
@@ -0,0 +1,3 @@
+#!/usr/bin/env node
+
+import '../dist/src/bing-index.js'; 
diff --git a/bin/bing-search-mcp b/bin/bing-search-mcp
@@ -0,0 +1,3 @@
+#!/usr/bin/env node
+
+import '../dist/src/mcp-server.js'; 
diff --git a/bin/bing-search-mcp.cmd b/bin/bing-search-mcp.cmd
@@ -0,0 +1,3 @@
+@IF EXIST "%~dp0\node.exe" (
+  "%~dp0\node.exe" "%~dp0\bing-search-mcp" %*
+) 
diff --git a/bin/bing-search.cmd b/bin/bing-search.cmd
@@ -0,0 +1,3 @@
+@IF EXIST "%~dp0\node.exe" (
+  "%~dp0\node.exe" "%~dp0\bing-search" %*
+) 
diff --git a/bin/search-mcp b/bin/search-mcp
@@ -0,0 +1,3 @@
+#!/usr/bin/env node
+
+import '../dist/src/mcp-server.js'; 
diff --git a/bin/search-mcp.cmd b/bin/search-mcp.cmd
@@ -0,0 +1,3 @@
+@IF EXIST "%~dp0\node.exe" (
+  "%~dp0\node.exe" "%~dp0\search-mcp" %*
+) 
diff --git a/bin/url-crawler b/bin/url-crawler
@@ -0,0 +1,21 @@
+#!/usr/bin/env node
+
+// 检查是否在开发环境中运行
+import { fileURLToPath } from 'url';
+import { dirname, resolve } from 'path';
+import { existsSync } from 'fs';
+import { createRequire } from 'module';
+
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = dirname(__filename);
+const isDevMode = process.env.NODE_ENV === 'development' || !existsSync(resolve(__dirname, '../dist'));
+
+if (isDevMode) {
+  // 开发模式：使用ts-node运行TypeScript源文件
+  const require = createRequire(import.meta.url);
+  require('ts-node').register();
+  await import('../src/url-crawler-test.js');
+} else {
+  // 生产模式：运行编译后的JavaScript文件
+  await import('../dist/src/url-crawler-test.js');
+} 
diff --git a/package.json b/package.json
@@ -1,13 +1,17 @@
 {
-  "name": "google-search-cli",
+  "name": "search-cli",
   "version": "1.0.0",
-  "description": "基于 Playwright 的 Google 搜索 CLI 工具",
+  "description": "基于 Playwright 的 Google 和 Bing 搜索 CLI 工具",
   "type": "module",
   "main": "dist/index.js",
   "types": "dist/index.d.ts",
   "bin": {
     "google-search": "./bin/google-search",
-    "google-search-mcp": "./bin/google-search-mcp"
+    "google-search-mcp": "./bin/google-search-mcp",
+    "bing-search": "./bin/bing-search",
+    "bing-search-mcp": "./bin/bing-search-mcp",
+    "search-mcp": "./bin/search-mcp",
+    "url-crawler": "./bin/url-crawler"
   },
   "scripts": {
     "build": "tsc",
@@ -21,7 +25,13 @@
     "link": "npm link",
     "clean": "node -e \"const fs = require('fs'); const path = require('path'); if (fs.existsSync('dist')) fs.rmSync('dist', { recursive: true, force: true });\"",
     "mcp": "ts-node src/mcp-server.ts",
-    "mcp:build": "npm run build && node dist/src/mcp-server.js"
+    "mcp:build": "npm run build && node dist/src/mcp-server.js",
+    "bing": "ts-node src/bing-index.ts",
+    "bing:build": "npm run build && node dist/src/bing-index.js",
+    "bing:test": "ts-node src/bing-index.ts \"playwright typescript\"",
+    "bing:debug": "ts-node src/bing-index.ts --no-headless \"playwright typescript\"",
+    "url-crawler": "ts-node src/url-crawler-test.ts",
+    "url-crawler:build": "npm run build && node dist/src/url-crawler-test.js"
   },
   "repository": {
     "type": "git",
@@ -57,5 +67,6 @@
   },
   "engines": {
     "node": ">=16.0.0"
-  }
+  },
+  "packageManager": "yarn@1.22.22+sha1.ac34549e6aa8e7ead463a7407e1c7390f61a6610"
 }
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		#!/usr/bin/env node

		import '../dist/src/bing-index.js';
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		#!/usr/bin/env node

		import '../dist/src/mcp-server.js';