Skip to content
88 changes: 88 additions & 0 deletions skills/firstdata/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,94 @@ Or add manually to your MCP config:

Once connected, browse the tool list provided by the firstdata MCP and select the appropriate tool based on your needs.

## MCP Tools Reference

The FirstData MCP server provides 5 tools. Below is a reference with usage guidelines, limitations, and examples.

### Common Limitations (all tools)

- **Authentication required**: All tools require a valid API key (JWT token) via `Authorization: Bearer <token>` header.
- **Daily call quota**: API usage is subject to a per-token daily call quota. Quota varies by API key tier (trial accounts: 30 calls/day). MCP tool calls do not return remaining quota information. To check quota, use the Token verification API (`POST /api/token/verify`) which returns `remaining_daily` in the response — this is a separate HTTP call, not available through MCP tool invocation.
- **Network dependency**: All tools make HTTP calls to the FirstData server (`firstdata.deepminer.com.cn`). Network latency and server availability affect response times.

### Tool: `search_source`

**Purpose**: Unified data source search tool supporting keyword search, structured filtering, pagination, and multiple output modes.

**Limitations**:
- Maximum **200** results per query (`limit` parameter range: 1–200, default: 20).
- ⚠️ **Each keyword is matched as an independent substring — pass each search term as a separate array element.** For example, use `["中国", "GDP"]` (173 results) instead of `["中国 GDP"]` (0 results). This is by design to preserve multi-word terms like `"New Zealand"` or `"World Bank"`.
- Keyword matching is **substring-based**, not semantic search. Keywords are matched against source metadata fields (name, description, tags, content).
- The `domain` parameter uses **substring matching**, not exact enum matching (e.g., `"finance"` matches `"public-finance"`, `"finance"`, `"financial-markets"`).
- No boolean operators (AND/OR/NOT). Multiple keywords in the array are combined with **OR logic** (results matching any keyword are returned, deduplicated).
- Response time: typically **~1 second**.

### Tool: `get_source`

**Purpose**: Retrieve full details for specific data sources by their IDs.

**Limitations**:
- Invalid `source_id` values do NOT cause an error response (`isError: false`). Instead, the result array includes `{"id": "xxx", "error": "Not found"}` for each invalid ID alongside valid results. Callers must check individual items for `error` fields rather than relying solely on `isError`.
- No schema-level limit on the number of `source_ids` per request, but performance with large batches (50+) is unverified. As a practical guideline (not a hard limit), consider batching in groups of ~20.
- The `fields` parameter filters returned fields; when omitted, all fields are returned.

### Tool: `ask_agent`

**Purpose**: LLM-powered intelligent search agent for complex, cross-domain, or ambiguous queries that require multi-step reasoning.

**Limitations**:
- Query length: 2–1,000 characters.
- Maximum results: 1–20 (default: 5).
- **Non-idempotent**: Same query may return different results across calls (LLM reasoning varies).
- **Response time: typically 2–8 seconds** (involves LLM inference). May take longer (10–30+ seconds) when the agent triggers `web_search` for external information.
- Internally uses LangChain ReAct agent with `jq` for local data queries plus optional `web_search`. The web search step is not user-controllable.
- **Use `search_source` instead** for simple keyword matching or structured filtering — it is faster, deterministic, and cheaper.

### Tool: `get_access_guide`

**Purpose**: Generate detailed access instructions for a specific data source using RAG (Retrieval-Augmented Generation).

**Limitations**:
- **Not all data sources have instruction libraries.** If a source has no pre-built instructions, results will be empty or irrelevant.
- Invalid `source_id` returns `{"error": "数据源 xxx 不存在"}`.
- `top_k` range: 1–5 (default: 3).
- **Response time is highly variable: 3–20 seconds**, depending on RAG retrieval complexity and server load.
- Retrieval quality depends heavily on the specificity of the `operation` parameter. Vague descriptions yield lower-quality matches. Use specific action verbs and entity names (e.g., "查询2024年M2货币供应量数据" rather than "查数据").

### Tool: `report_feedback`

**Purpose**: Submit user feedback to the development team when FirstData has a confirmed issue.

**Limitations**:
- `feedback_message` length: 10–2,000 characters.
- **Non-idempotent**: Duplicate calls create duplicate feedback entries. Do not retry on success.
- Only use when a genuine issue is confirmed (missing source, incorrect data, broken functionality). Do not use as a general comment channel.

**Examples**:

```
# Example 1: Broken link
feedback_message="链接失效:数据源 china-pbc 的 data_url 返回 404,无法访问数据页面。检索关键词:中国货币供应量"

# Example 2: Outdated content
feedback_message="数据内容过时:数据源 worldbank-open-data 的 update_frequency 标注为 quarterly,但实际已超过 6 个月未更新"
```

## Description Quality Guidelines

When adding or modifying MCP tool descriptions, follow these principles (based on [MCP tool description quality research](https://arxiv.org/abs/2602.14878)):

**Core principle: "Write it right before writing it all"** — Functionality accuracy (+11.6% impact) matters ~8× more than Conciseness (+1.5%).

**6-dimension checklist** (check all before submitting):

- [ ] **Purpose**: Is the tool's function clearly stated in the first sentence?
- [ ] **Guidelines**: Are usage scenarios and when-to-use / when-not-to-use rules included?
- [ ] **Examples**: Are typical input/output examples provided?
- [ ] **Limitations**: Are constraints, edge cases, and known limitations documented?
- [ ] **Parameters**: Are all parameters described with types, ranges, and defaults?
- [ ] **Return Format**: Is the response structure documented?

## Community

FirstData is an open-source project — join us in building the authoritative data source knowledge base for agents:
Expand Down
91 changes: 91 additions & 0 deletions skills/firstdata/mcp-tool-descriptions-draft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# MCP Tool Descriptions — Server-Side Draft

> **Purpose**: This file contains the text to be added to each tool's description in the MCP server Python code.
> After PR review approval, copy these Limitations sections into the server-side tool description strings.
> Content is condensed from `SKILL.md` for server-side use. Semantics must match; formatting may differ slightly for plain-text context.

---

## search_source — Add to description

```
**Limitations:**
- Maximum 200 results per query (limit range: 1–200, default: 20)
- Each keyword is matched as an independent substring — pass each search term as a separate array element. Use ["中国", "GDP"] instead of ["中国 GDP"]. This preserves multi-word terms like "New Zealand" or "World Bank"
- Keyword matching is substring-based, not semantic search
- domain parameter uses substring matching, not exact enum matching
- No boolean operators (AND/OR/NOT). Multiple keywords use OR logic (results matching any keyword are returned, deduplicated)
- Response time: typically ~1 second
- Subject to daily API call quota per token. MCP tool calls do not return remaining quota; use Token verification API (POST /api/token/verify, returns remaining_daily) to check
```

## get_source — Add to description

```
**Limitations:**
- Invalid source_id does NOT set isError=true. Returns {"id": "xxx", "error": "Not found"} in the result array. Callers must check individual items for error fields
- No schema-level limit on source_ids count, but large batch performance is unverified. Practical guideline (not a hard limit): batch in groups of ~20
- Subject to daily API call quota per token
```

## ask_agent — Add to description

```
**Limitations:**
- Query length: 2–1,000 characters
- Maximum results: 1–20 (default: 5)
- Non-idempotent: same query may return different results (LLM reasoning varies)
- Response time: typically 2–8 seconds; may reach 10–30+ seconds when web_search is triggered
- Subject to daily API call quota per token
```

## get_access_guide — Add to description

```
**Limitations:**
- Not all data sources have instruction libraries. Sources without pre-built instructions return empty or irrelevant results
- Invalid source_id returns {"error": "数据源 xxx 不存在"}
- top_k range: 1–5 (default: 3)
- Response time is highly variable: 3–20 seconds depending on RAG retrieval complexity and server load
- Retrieval quality depends on specificity of the operation parameter. Use specific action verbs and entity names
- Subject to daily API call quota per token
```

## report_feedback — Add to description + Example

```
**Limitations:**
- feedback_message length: 10–2,000 characters
- Non-idempotent: duplicate calls create duplicate feedback entries. Do not retry on success
- Subject to daily API call quota per token

**示例:**
- 链接失效反馈: feedback_message="链接失效:数据源 china-pbc 的 data_url 返回 404,无法访问数据页面。检索关键词:中国货币供应量"
- 数据过时反馈: feedback_message="数据内容过时:数据源 worldbank-open-data 的 update_frequency 标注为 quarterly,但实际已超过 6 个月未更新"
```

---

## Verification Evidence

Each limitation is backed by one of these sources:

| Limitation | Source |
|---|---|
| search_source limit: 1–200 | inputSchema `maximum: 200, minimum: 1` |
| Keywords not auto-tokenized | Tested: `["中国 GDP"]` → 0 results; `["中国", "GDP"]` → 173 results |
| Multiple keywords use OR logic | Tested: `["GDP"]`→100, `["health"]`→78, `["GDP","health"]`→138 (>max, confirmed OR) |
| Substring matching | Tested: `["中国GDP"]` → 1 result (exact substring); `["GDP"]` → 100 results |
| domain substring matching | inputSchema description: "领域关键词,子串匹配" |
| get_source silent error | Tested: invalid ID returns `{"id":"xxx","error":"Not found"}` with `isError: false` |
| get_source mixed valid/invalid | Tested: valid IDs return data, invalid return error objects, no request interruption |
| ask_agent query length | inputSchema `minLength: 2, maxLength: 1000` |
| ask_agent max_results | inputSchema `minimum: 1, maximum: 20, default: 5` |
| ask_agent non-idempotent | annotations `idempotentHint: false` |
| ask_agent response time | Tested 3 runs: 7.4s, 2.9s, 1.8s |
| get_access_guide invalid source | Tested: returns `{"error": "数据源 xxx 不存在"}` |
| get_access_guide top_k | inputSchema `minimum: 1, maximum: 5, default: 3` |
| get_access_guide response time | Tested 3 runs: 3.0s, 17.6s, 19.1s |
| report_feedback message length | inputSchema `minLength: 10, maxLength: 2000` |
| Daily call quota exists | TokenVerifyResponse schema: `quota_allowed`, `remaining_daily` fields |
| Trial quota: 30/day | Tested via `/api/trial/session-info`: `total_calls: 30` |