diff --git a/skills/firstdata/SKILL.md b/skills/firstdata/SKILL.md index 498f608..4a667a9 100644 --- a/skills/firstdata/SKILL.md +++ b/skills/firstdata/SKILL.md @@ -86,6 +86,94 @@ Or add manually to your MCP config: Once connected, browse the tool list provided by the firstdata MCP and select the appropriate tool based on your needs. +## MCP Tools Reference + +The FirstData MCP server provides 5 tools. Below is a reference with usage guidelines, limitations, and examples. + +### Common Limitations (all tools) + +- **Authentication required**: All tools require a valid API key (JWT token) via `Authorization: Bearer ` header. +- **Daily call quota**: API usage is subject to a per-token daily call quota. Quota varies by API key tier (trial accounts: 30 calls/day). MCP tool calls do not return remaining quota information. To check quota, use the Token verification API (`POST /api/token/verify`) which returns `remaining_daily` in the response — this is a separate HTTP call, not available through MCP tool invocation. +- **Network dependency**: All tools make HTTP calls to the FirstData server (`firstdata.deepminer.com.cn`). Network latency and server availability affect response times. + +### Tool: `search_source` + +**Purpose**: Unified data source search tool supporting keyword search, structured filtering, pagination, and multiple output modes. + +**Limitations**: +- Maximum **200** results per query (`limit` parameter range: 1–200, default: 20). +- ⚠️ **Each keyword is matched as an independent substring — pass each search term as a separate array element.** For example, use `["中国", "GDP"]` (173 results) instead of `["中国 GDP"]` (0 results). This is by design to preserve multi-word terms like `"New Zealand"` or `"World Bank"`. +- Keyword matching is **substring-based**, not semantic search. Keywords are matched against source metadata fields (name, description, tags, content). +- The `domain` parameter uses **substring matching**, not exact enum matching (e.g., `"finance"` matches `"public-finance"`, `"finance"`, `"financial-markets"`). +- No boolean operators (AND/OR/NOT). Multiple keywords in the array are combined with **OR logic** (results matching any keyword are returned, deduplicated). +- Response time: typically **~1 second**. + +### Tool: `get_source` + +**Purpose**: Retrieve full details for specific data sources by their IDs. + +**Limitations**: +- Invalid `source_id` values do NOT cause an error response (`isError: false`). Instead, the result array includes `{"id": "xxx", "error": "Not found"}` for each invalid ID alongside valid results. Callers must check individual items for `error` fields rather than relying solely on `isError`. +- No schema-level limit on the number of `source_ids` per request, but performance with large batches (50+) is unverified. As a practical guideline (not a hard limit), consider batching in groups of ~20. +- The `fields` parameter filters returned fields; when omitted, all fields are returned. + +### Tool: `ask_agent` + +**Purpose**: LLM-powered intelligent search agent for complex, cross-domain, or ambiguous queries that require multi-step reasoning. + +**Limitations**: +- Query length: 2–1,000 characters. +- Maximum results: 1–20 (default: 5). +- **Non-idempotent**: Same query may return different results across calls (LLM reasoning varies). +- **Response time: typically 2–8 seconds** (involves LLM inference). May take longer (10–30+ seconds) when the agent triggers `web_search` for external information. +- Internally uses LangChain ReAct agent with `jq` for local data queries plus optional `web_search`. The web search step is not user-controllable. +- **Use `search_source` instead** for simple keyword matching or structured filtering — it is faster, deterministic, and cheaper. + +### Tool: `get_access_guide` + +**Purpose**: Generate detailed access instructions for a specific data source using RAG (Retrieval-Augmented Generation). + +**Limitations**: +- **Not all data sources have instruction libraries.** If a source has no pre-built instructions, results will be empty or irrelevant. +- Invalid `source_id` returns `{"error": "数据源 xxx 不存在"}`. +- `top_k` range: 1–5 (default: 3). +- **Response time is highly variable: 3–20 seconds**, depending on RAG retrieval complexity and server load. +- Retrieval quality depends heavily on the specificity of the `operation` parameter. Vague descriptions yield lower-quality matches. Use specific action verbs and entity names (e.g., "查询2024年M2货币供应量数据" rather than "查数据"). + +### Tool: `report_feedback` + +**Purpose**: Submit user feedback to the development team when FirstData has a confirmed issue. + +**Limitations**: +- `feedback_message` length: 10–2,000 characters. +- **Non-idempotent**: Duplicate calls create duplicate feedback entries. Do not retry on success. +- Only use when a genuine issue is confirmed (missing source, incorrect data, broken functionality). Do not use as a general comment channel. + +**Examples**: + +``` +# Example 1: Broken link +feedback_message="链接失效:数据源 china-pbc 的 data_url 返回 404,无法访问数据页面。检索关键词:中国货币供应量" + +# Example 2: Outdated content +feedback_message="数据内容过时:数据源 worldbank-open-data 的 update_frequency 标注为 quarterly,但实际已超过 6 个月未更新" +``` + +## Description Quality Guidelines + +When adding or modifying MCP tool descriptions, follow these principles (based on [MCP tool description quality research](https://arxiv.org/abs/2602.14878)): + +**Core principle: "Write it right before writing it all"** — Functionality accuracy (+11.6% impact) matters ~8× more than Conciseness (+1.5%). + +**6-dimension checklist** (check all before submitting): + +- [ ] **Purpose**: Is the tool's function clearly stated in the first sentence? +- [ ] **Guidelines**: Are usage scenarios and when-to-use / when-not-to-use rules included? +- [ ] **Examples**: Are typical input/output examples provided? +- [ ] **Limitations**: Are constraints, edge cases, and known limitations documented? +- [ ] **Parameters**: Are all parameters described with types, ranges, and defaults? +- [ ] **Return Format**: Is the response structure documented? + ## Community FirstData is an open-source project — join us in building the authoritative data source knowledge base for agents: diff --git a/skills/firstdata/mcp-tool-descriptions-draft.md b/skills/firstdata/mcp-tool-descriptions-draft.md new file mode 100644 index 0000000..4c2a8f7 --- /dev/null +++ b/skills/firstdata/mcp-tool-descriptions-draft.md @@ -0,0 +1,91 @@ +# MCP Tool Descriptions — Server-Side Draft + +> **Purpose**: This file contains the text to be added to each tool's description in the MCP server Python code. +> After PR review approval, copy these Limitations sections into the server-side tool description strings. +> Content is condensed from `SKILL.md` for server-side use. Semantics must match; formatting may differ slightly for plain-text context. + +--- + +## search_source — Add to description + +``` +**Limitations:** +- Maximum 200 results per query (limit range: 1–200, default: 20) +- Each keyword is matched as an independent substring — pass each search term as a separate array element. Use ["中国", "GDP"] instead of ["中国 GDP"]. This preserves multi-word terms like "New Zealand" or "World Bank" +- Keyword matching is substring-based, not semantic search +- domain parameter uses substring matching, not exact enum matching +- No boolean operators (AND/OR/NOT). Multiple keywords use OR logic (results matching any keyword are returned, deduplicated) +- Response time: typically ~1 second +- Subject to daily API call quota per token. MCP tool calls do not return remaining quota; use Token verification API (POST /api/token/verify, returns remaining_daily) to check +``` + +## get_source — Add to description + +``` +**Limitations:** +- Invalid source_id does NOT set isError=true. Returns {"id": "xxx", "error": "Not found"} in the result array. Callers must check individual items for error fields +- No schema-level limit on source_ids count, but large batch performance is unverified. Practical guideline (not a hard limit): batch in groups of ~20 +- Subject to daily API call quota per token +``` + +## ask_agent — Add to description + +``` +**Limitations:** +- Query length: 2–1,000 characters +- Maximum results: 1–20 (default: 5) +- Non-idempotent: same query may return different results (LLM reasoning varies) +- Response time: typically 2–8 seconds; may reach 10–30+ seconds when web_search is triggered +- Subject to daily API call quota per token +``` + +## get_access_guide — Add to description + +``` +**Limitations:** +- Not all data sources have instruction libraries. Sources without pre-built instructions return empty or irrelevant results +- Invalid source_id returns {"error": "数据源 xxx 不存在"} +- top_k range: 1–5 (default: 3) +- Response time is highly variable: 3–20 seconds depending on RAG retrieval complexity and server load +- Retrieval quality depends on specificity of the operation parameter. Use specific action verbs and entity names +- Subject to daily API call quota per token +``` + +## report_feedback — Add to description + Example + +``` +**Limitations:** +- feedback_message length: 10–2,000 characters +- Non-idempotent: duplicate calls create duplicate feedback entries. Do not retry on success +- Subject to daily API call quota per token + +**示例:** +- 链接失效反馈: feedback_message="链接失效:数据源 china-pbc 的 data_url 返回 404,无法访问数据页面。检索关键词:中国货币供应量" +- 数据过时反馈: feedback_message="数据内容过时:数据源 worldbank-open-data 的 update_frequency 标注为 quarterly,但实际已超过 6 个月未更新" +``` + +--- + +## Verification Evidence + +Each limitation is backed by one of these sources: + +| Limitation | Source | +|---|---| +| search_source limit: 1–200 | inputSchema `maximum: 200, minimum: 1` | +| Keywords not auto-tokenized | Tested: `["中国 GDP"]` → 0 results; `["中国", "GDP"]` → 173 results | +| Multiple keywords use OR logic | Tested: `["GDP"]`→100, `["health"]`→78, `["GDP","health"]`→138 (>max, confirmed OR) | +| Substring matching | Tested: `["中国GDP"]` → 1 result (exact substring); `["GDP"]` → 100 results | +| domain substring matching | inputSchema description: "领域关键词,子串匹配" | +| get_source silent error | Tested: invalid ID returns `{"id":"xxx","error":"Not found"}` with `isError: false` | +| get_source mixed valid/invalid | Tested: valid IDs return data, invalid return error objects, no request interruption | +| ask_agent query length | inputSchema `minLength: 2, maxLength: 1000` | +| ask_agent max_results | inputSchema `minimum: 1, maximum: 20, default: 5` | +| ask_agent non-idempotent | annotations `idempotentHint: false` | +| ask_agent response time | Tested 3 runs: 7.4s, 2.9s, 1.8s | +| get_access_guide invalid source | Tested: returns `{"error": "数据源 xxx 不存在"}` | +| get_access_guide top_k | inputSchema `minimum: 1, maximum: 5, default: 3` | +| get_access_guide response time | Tested 3 runs: 3.0s, 17.6s, 19.1s | +| report_feedback message length | inputSchema `minLength: 10, maxLength: 2000` | +| Daily call quota exists | TokenVerifyResponse schema: `quota_allowed`, `remaining_daily` fields | +| Trial quota: 30/day | Tested via `/api/trial/session-info`: `total_calls: 30` |