MLT-OSS · firstdata-dev · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/skills/firstdata/SKILL.md b/skills/firstdata/SKILL.md
@@ -86,6 +86,94 @@ Or add manually to your MCP config:
 
 Once connected, browse the tool list provided by the firstdata MCP and select the appropriate tool based on your needs.
 
+## MCP Tools Reference
+
+The FirstData MCP server provides 5 tools. Below is a reference with usage guidelines, limitations, and examples.
+
+### Common Limitations (all tools)
+
+- **Authentication required**: All tools require a valid API key (JWT token) via `Authorization: Bearer <token>` header.
+- **Daily call quota**: API usage is subject to a per-token daily call quota. Quota varies by API key tier (trial accounts: 30 calls/day). MCP tool calls do not return remaining quota information. To check quota, use the Token verification API (`POST /api/token/verify`) which returns `remaining_daily` in the response — this is a separate HTTP call, not available through MCP tool invocation.
+- **Network dependency**: All tools make HTTP calls to the FirstData server (`firstdata.deepminer.com.cn`). Network latency and server availability affect response times.
+
+### Tool: `search_source`
+
+**Purpose**: Unified data source search tool supporting keyword search, structured filtering, pagination, and multiple output modes.
+
+**Limitations**:
+- Maximum **200** results per query (`limit` parameter range: 1–200, default: 20).
+- ⚠️ **Each keyword is matched as an independent substring — pass each search term as a separate array element.** For example, use `["中国", "GDP"]` (173 results) instead of `["中国 GDP"]` (0 results). This is by design to preserve multi-word terms like `"New Zealand"` or `"World Bank"`.
+- Keyword matching is **substring-based**, not semantic search. Keywords are matched against source metadata fields (name, description, tags, content).
+- The `domain` parameter uses **substring matching**, not exact enum matching (e.g., `"finance"` matches `"public-finance"`, `"finance"`, `"financial-markets"`).
+- No boolean operators (AND/OR/NOT). Multiple keywords in the array are combined with **OR logic** (results matching any keyword are returned, deduplicated).
+- Response time: typically **~1 second**.
+
+### Tool: `get_source`
+
+**Purpose**: Retrieve full details for specific data sources by their IDs.
+
+**Limitations**:
+- Invalid `source_id` values do NOT cause an error response (`isError: false`). Instead, the result array includes `{"id": "xxx", "error": "Not found"}` for each invalid ID alongside valid results. Callers must check individual items for `error` fields rather than relying solely on `isError`.
+- No schema-level limit on the number of `source_ids` per request, but performance with large batches (50+) is unverified. As a practical guideline (not a hard limit), consider batching in groups of ~20.
+- The `fields` parameter filters returned fields; when omitted, all fields are returned.
+
+### Tool: `ask_agent`
+
+**Purpose**: LLM-powered intelligent search agent for complex, cross-domain, or ambiguous queries that require multi-step reasoning.
+
+**Limitations**:
+- Query length: 2–1,000 characters.
+- Maximum results: 1–20 (default: 5).
+- **Non-idempotent**: Same query may return different results across calls (LLM reasoning varies).
+- **Response time: typically 2–8 seconds** (involves LLM inference). May take longer (10–30+ seconds) when the agent triggers `web_search` for external information.
+- Internally uses LangChain ReAct agent with `jq` for local data queries plus optional `web_search`. The web search step is not user-controllable.
+- **Use `search_source` instead** for simple keyword matching or structured filtering — it is faster, deterministic, and cheaper.
+
+### Tool: `get_access_guide`
+
+**Purpose**: Generate detailed access instructions for a specific data source using RAG (Retrieval-Augmented Generation).
+
+**Limitations**:
+- **Not all data sources have instruction libraries.** If a source has no pre-built instructions, results will be empty or irrelevant.
+- Invalid `source_id` returns `{"error": "数据源 xxx 不存在"}`.
+- `top_k` range: 1–5 (default: 3).
+- **Response time is highly variable: 3–20 seconds**, depending on RAG retrieval complexity and server load.
+- Retrieval quality depends heavily on the specificity of the `operation` parameter. Vague descriptions yield lower-quality matches. Use specific action verbs and entity names (e.g., "查询2024年M2货币供应量数据" rather than "查数据").
+
+### Tool: `report_feedback`
+
+**Purpose**: Submit user feedback to the development team when FirstData has a confirmed issue.
+
+**Limitations**:
+- `feedback_message` length: 10–2,000 characters.
+- **Non-idempotent**: Duplicate calls create duplicate feedback entries. Do not retry on success.
+- Only use when a genuine issue is confirmed (missing source, incorrect data, broken functionality). Do not use as a general comment channel.
+
+**Examples**:
+
+```
+# Example 1: Broken link
+feedback_message="链接失效：数据源 china-pbc 的 data_url 返回 404，无法访问数据页面。检索关键词：中国货币供应量"
+
+# Example 2: Outdated content
+feedback_message="数据内容过时：数据源 worldbank-open-data 的 update_frequency 标注为 quarterly，但实际已超过 6 个月未更新"
+```
+
+## Description Quality Guidelines
+
+When adding or modifying MCP tool descriptions, follow these principles (based on [MCP tool description quality research](https://arxiv.org/abs/2602.14878)):
+
+**Core principle: "Write it right before writing it all"** — Functionality accuracy (+11.6% impact) matters ~8× more than Conciseness (+1.5%).
+
+**6-dimension checklist** (check all before submitting):
+
+- [ ] **Purpose**: Is the tool's function clearly stated in the first sentence?
+- [ ] **Guidelines**: Are usage scenarios and when-to-use / when-not-to-use rules included?
+- [ ] **Examples**: Are typical input/output examples provided?
+- [ ] **Limitations**: Are constraints, edge cases, and known limitations documented?
+- [ ] **Parameters**: Are all parameters described with types, ranges, and defaults?
+- [ ] **Return Format**: Is the response structure documented?
+
 ## Community
 
 FirstData is an open-source project — join us in building the authoritative data source knowledge base for agents:

diff --git a/skills/firstdata/mcp-tool-descriptions-draft.md b/skills/firstdata/mcp-tool-descriptions-draft.md
@@ -0,0 +1,91 @@
+# MCP Tool Descriptions — Server-Side Draft
+
+> **Purpose**: This file contains the text to be added to each tool's description in the MCP server Python code.
+> After PR review approval, copy these Limitations sections into the server-side tool description strings.
+> Content is condensed from `SKILL.md` for server-side use. Semantics must match; formatting may differ slightly for plain-text context.
+
+---
+
+## search_source — Add to description
+
+```
+**Limitations:**
+- Maximum 200 results per query (limit range: 1–200, default: 20)
+- Each keyword is matched as an independent substring — pass each search term as a separate array element. Use ["中国", "GDP"] instead of ["中国 GDP"]. This preserves multi-word terms like "New Zealand" or "World Bank"
+- Keyword matching is substring-based, not semantic search
+- domain parameter uses substring matching, not exact enum matching
+- No boolean operators (AND/OR/NOT). Multiple keywords use OR logic (results matching any keyword are returned, deduplicated)
+- Response time: typically ~1 second
+- Subject to daily API call quota per token. MCP tool calls do not return remaining quota; use Token verification API (POST /api/token/verify, returns remaining_daily) to check
+```
+
+## get_source — Add to description
+
+```
+**Limitations:**
+- Invalid source_id does NOT set isError=true. Returns {"id": "xxx", "error": "Not found"} in the result array. Callers must check individual items for error fields
+- No schema-level limit on source_ids count, but large batch performance is unverified. Practical guideline (not a hard limit): batch in groups of ~20
+- Subject to daily API call quota per token
+```
+
+## ask_agent — Add to description
+
+```
+**Limitations:**
+- Query length: 2–1,000 characters
+- Maximum results: 1–20 (default: 5)
+- Non-idempotent: same query may return different results (LLM reasoning varies)
+- Response time: typically 2–8 seconds; may reach 10–30+ seconds when web_search is triggered
+- Subject to daily API call quota per token
+```
+
+## get_access_guide — Add to description
+
+```
+**Limitations:**
+- Not all data sources have instruction libraries. Sources without pre-built instructions return empty or irrelevant results
+- Invalid source_id returns {"error": "数据源 xxx 不存在"}
+- top_k range: 1–5 (default: 3)
+- Response time is highly variable: 3–20 seconds depending on RAG retrieval complexity and server load
+- Retrieval quality depends on specificity of the operation parameter. Use specific action verbs and entity names
+- Subject to daily API call quota per token
+```
+
+## report_feedback — Add to description + Example
+
+```
+**Limitations:**
+- feedback_message length: 10–2,000 characters
+- Non-idempotent: duplicate calls create duplicate feedback entries. Do not retry on success
+- Subject to daily API call quota per token
+
+**示例:**
+- 链接失效反馈: feedback_message="链接失效：数据源 china-pbc 的 data_url 返回 404，无法访问数据页面。检索关键词：中国货币供应量"
+- 数据过时反馈: feedback_message="数据内容过时：数据源 worldbank-open-data 的 update_frequency 标注为 quarterly，但实际已超过 6 个月未更新"
+```
+
+---
+
+## Verification Evidence
+
+Each limitation is backed by one of these sources:
+
+| Limitation | Source |
+|---|---|
+| search_source limit: 1–200 | inputSchema `maximum: 200, minimum: 1` |
+| Keywords not auto-tokenized | Tested: `["中国 GDP"]` → 0 results; `["中国", "GDP"]` → 173 results |
+| Multiple keywords use OR logic | Tested: `["GDP"]`→100, `["health"]`→78, `["GDP","health"]`→138 (>max, confirmed OR) |
+| Substring matching | Tested: `["中国GDP"]` → 1 result (exact substring); `["GDP"]` → 100 results |
+| domain substring matching | inputSchema description: "领域关键词，子串匹配" |
+| get_source silent error | Tested: invalid ID returns `{"id":"xxx","error":"Not found"}` with `isError: false` |
+| get_source mixed valid/invalid | Tested: valid IDs return data, invalid return error objects, no request interruption |
+| ask_agent query length | inputSchema `minLength: 2, maxLength: 1000` |
+| ask_agent max_results | inputSchema `minimum: 1, maximum: 20, default: 5` |
+| ask_agent non-idempotent | annotations `idempotentHint: false` |
+| ask_agent response time | Tested 3 runs: 7.4s, 2.9s, 1.8s |
+| get_access_guide invalid source | Tested: returns `{"error": "数据源 xxx 不存在"}` |
+| get_access_guide top_k | inputSchema `minimum: 1, maximum: 5, default: 3` |
+| get_access_guide response time | Tested 3 runs: 3.0s, 17.6s, 19.1s |
+| report_feedback message length | inputSchema `minLength: 10, maxLength: 2000` |
+| Daily call quota exists | TokenVerifyResponse schema: `quota_allowed`, `remaining_daily` fields |
+| Trial quota: 30/day | Tested via `/api/trial/session-info`: `total_calls: 30` |