Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 31 additions & 8 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,45 @@ This file provides guidance to AI coding agents (Claude Code, Cursor, Copilot, e

## Repository Overview

A collection of skills for AI agents working with ClickHouse databases. Skills are packaged instructions and guidelines that extend agent capabilities for database design, query optimization, and operational best practices.
A collection of skills for AI agents working with ClickHouse databases and chdb (in-process ClickHouse for Python). Skills are packaged instructions and guidelines that extend agent capabilities for database design, query optimization, operational best practices, and in-process data analytics.

## Repository Structure

```
agent-skills/
├── skills/
│ └── clickhouse-best-practices/ # ClickHouse optimization guidelines
│ ├── SKILL.md # Skill definition (overview)
│ ├── AGENTS.md # Full compiled guide (generated)
│ ├── clickhouse-best-practices/ # ClickHouse optimization guidelines
│ │ ├── SKILL.md # Skill definition (overview)
│ │ ├── AGENTS.md # Full compiled guide (generated)
│ │ ├── metadata.json # Version, organization, abstract
│ │ ├── README.md # Maintainer guide
│ │ └── rules/ # Individual rule files
│ │ ├── _sections.md # Section metadata
│ │ ├── _template.md # Template for new rules
│ │ └── *.md # Rule files (e.g., query-use-prewhere.md)
│ ├── chdb-datastore/ # chdb pandas-compatible DataStore API
│ │ ├── SKILL.md # Skill definition and quick-start
│ │ ├── metadata.json # Version, organization, abstract
│ │ ├── README.md # Maintainer guide
│ │ ├── references/ # API reference docs
│ │ │ ├── api-reference.md # DataStore method signatures
│ │ │ └── connectors.md # All data source connection methods
│ │ ├── examples/
│ │ │ └── examples.md # Runnable examples
│ │ └── scripts/
│ │ └── verify_install.py # Environment verification
│ └── chdb-sql/ # chdb SQL API
│ ├── SKILL.md # Skill definition and quick-start
│ ├── metadata.json # Version, organization, abstract
│ ├── README.md # Maintainer guide
│ └── rules/ # Individual rule files
│ ├── _sections.md # Section metadata
│ ├── _template.md # Template for new rules
│ └── *.md # Rule files (e.g., query-use-prewhere.md)
│ ├── references/ # SQL reference docs
│ │ ├── api-reference.md # query/Session/connect signatures
│ │ ├── table-functions.md # ClickHouse table functions
│ │ └── sql-functions.md # Commonly used SQL functions
│ ├── examples/
│ │ └── examples.md # Runnable examples
│ └── scripts/
│ └── verify_install.py # Environment verification
├── packages/
│ └── clickhouse-best-practices-build/ # Build tooling
│ ├── package.json # Bun scripts
Expand Down
39 changes: 33 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ClickHouse Agent Skills

The official Agent Skills for [ClickHouse](https://clickhouse.com/). These skills help LLMs and agents to adopt best practices when working with ClickHouse.
The official Agent Skills for [ClickHouse](https://clickhouse.com/). These skills help LLMs and agents to adopt best practices when working with ClickHouse and [chdb](https://clickhouse.com/docs/chdb) (in-process ClickHouse for Python).

You can use these skills with open-source ClickHouse and managed ClickHouse Cloud. [Try ClickHouse Cloud with $300 in free credits](https://clickhouse.com/cloud?utm_medium=github&utm_source=github&utm_ref=agent-skills).

Expand All @@ -14,9 +14,9 @@ The CLI auto-detects installed agents and prompts you to select where to install

## What is this?

Agent Skills are packaged instructions that extend AI coding agents (Claude Code, Cursor, Copilot, etc.) with domain-specific expertise. This repository provides skills for ClickHouse databasescovering schema design, query optimization, and data ingestion patterns.
Agent Skills are packaged instructions that extend AI coding agents (Claude Code, Cursor, Copilot, etc.) with domain-specific expertise. This repository provides skills for ClickHouse databases and chdb — covering schema design, query optimization, data ingestion patterns, and in-process analytics with Python.

When an agent loads these skills, it gains knowledge of ClickHouse best practices and can apply them while helping you design tables, write queries, or troubleshoot performance issues.
When an agent loads these skills, it gains knowledge of ClickHouse best practices and chdb APIs, and can apply them while helping you design tables, write queries, analyze data, or troubleshoot performance issues.

Skills follow the open specification at [agentskills.io](https://agentskills.io).

Expand Down Expand Up @@ -46,9 +46,25 @@ Skills follow the open specification at [agentskills.io](https://agentskills.io)

**For agents:** The skill activates automatically when you work with ClickHouse—creating tables, writing queries, or designing data pipelines.

### chdb DataStore

**Pandas-compatible API** for chdb — drop-in pandas replacement backed by ClickHouse. Write `import chdb.datastore as pd` and use the same pandas API, 10-100x faster. Supports 16+ data sources (MySQL, PostgreSQL, S3, MongoDB, Iceberg, Delta Lake, etc.) with cross-source joins.

**Location:** [`skills/chdb-datastore/`](./skills/chdb-datastore/)

**For agents:** The skill activates when you analyze data with pandas-style syntax, speed up slow pandas code, query remote databases as DataFrames, or join data across different sources.

### chdb SQL

**In-process ClickHouse SQL** for Python — run SQL queries on local files, remote databases, and cloud storage without a server. Covers `chdb.query()`, Session, DB-API 2.0, parametrized queries, UDFs, streaming, and all ClickHouse table functions.

**Location:** [`skills/chdb-sql/`](./skills/chdb-sql/)

**For agents:** The skill activates when you write SQL queries against files, use ClickHouse table functions, build stateful analytical pipelines, or use advanced ClickHouse SQL features.

## Quick Start

After installation, your AI agent will reference these best practices when:
After installation, your AI agent will reference these skills when:

- Creating new tables with `CREATE TABLE`
- Choosing `ORDER BY` / `PRIMARY KEY` columns
Expand All @@ -57,11 +73,22 @@ After installation, your AI agent will reference these best practices when:
- Writing or tuning JOINs
- Designing data ingestion pipelines
- Handling updates or deletes
- Analyzing data with pandas-style DataStore API
- Querying files or databases with chdb SQL
- Joining data across different sources (MySQL + S3 + local files)

Example prompt:
Example prompts:
> "Create a table for storing user events with fields for user_id, event_type, properties (JSON), and timestamp"

The agent will apply relevant rules like proper column ordering in the primary key, appropriate data types, and partitioning strategy.
The agent will apply relevant ClickHouse best practices rules.

> "Load this Parquet file and group by country, show top 10 by revenue"

The agent will use chdb DataStore or SQL to query the file directly.

> "Join my MySQL customers table with this local orders.parquet file"

The agent will use chdb's cross-source join capabilities.

## Supported Agents

Expand Down
39 changes: 39 additions & 0 deletions skills/chdb-datastore/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# chdb DataStore

Agent skill for using chdb's pandas-compatible DataStore API — a drop-in pandas replacement backed by ClickHouse.

## Installation

```bash
npx skills add clickhouse/agent-skills
```

## What's Included

| File | Purpose |
|------|---------|
| `SKILL.md` | Skill definition and quick-start guide |
| `references/api-reference.md` | Full DataStore method signatures |
| `references/connectors.md` | All 16+ data source connection methods |
| `examples/examples.md` | 11 runnable examples with expected output |
| `scripts/verify_install.py` | Environment verification script |

## Trigger Phrases

This skill activates when you:
- "Analyze this file with pandas"
- "Speed up my pandas code"
- "Query this MySQL/PostgreSQL/S3 table as a DataFrame"
- "Join data from different sources"
- "Use DataStore to..."
- "Import datastore as pd"

## Related

- **chdb-sql** — For raw ClickHouse SQL queries, use the `chdb-sql` skill instead
- **clickhouse-best-practices** — For ClickHouse schema/query optimization

## Documentation

- [chdb docs](https://clickhouse.com/docs/chdb)
- [chdb GitHub](https://github.com/chdb-io/chdb)
146 changes: 146 additions & 0 deletions skills/chdb-datastore/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
name: chdb-datastore
description: >-
Drop-in pandas replacement with ClickHouse performance. Use
`import chdb.datastore as pd` (or `from datastore import DataStore`)
and write standard pandas code — same API, 10-100x faster on large
datasets. Supports 16+ data sources (MySQL, PostgreSQL, S3, MongoDB,
ClickHouse, Iceberg, Delta Lake, etc.) and 10+ file formats (Parquet,
CSV, JSON, Arrow, ORC, etc.) with cross-source joins. Use this skill
when the user wants to analyze data with pandas-style syntax, speed
up slow pandas code, query remote databases or cloud storage as
DataFrames, or join data across different sources — even if they
don't explicitly mention chdb or DataStore. Do NOT use for raw SQL
queries, ClickHouse server administration, or non-Python languages.
license: Apache-2.0
compatibility: Requires Python 3.9+, macOS or Linux. pip install chdb.
metadata:
author: chdb-io
version: "4.1"
homepage: https://clickhouse.com/docs/chdb
---

# chdb DataStore — It's Just Faster Pandas

## The Key Insight

```python
# Change this:
import pandas as pd
# To this:
import chdb.datastore as pd
# Everything else stays the same.
```

DataStore is a **lazy, ClickHouse-backed pandas replacement**. Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., `print()`, `len()`, iteration).

```bash
pip install chdb
```

## Decision Tree: Pick the Right Approach

```
1. "I have a file/database and want to analyze it with pandas"
→ DataStore.from_file() / from_mysql() / from_s3() etc.
→ See references/connectors.md

2. "I need to join data from different sources"
→ Create DataStores from each source, use .join()
→ See examples/examples.md #3-5

3. "My pandas code is too slow"
→ import chdb.datastore as pd — change one line, keep the rest

4. "I need raw SQL queries"
→ Use the chdb-sql skill instead
```

## Connect to Any Data Source — One Pattern

```python
from datastore import DataStore

# Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)
ds = DataStore.from_file("sales.parquet")

# Database
ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")

# Cloud storage
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)

# URI shorthand — auto-detects source type
ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
```

All 16+ sources and URI schemes → [connectors.md](references/connectors.md)

## After Connecting — Full Pandas API

```python
result = ds[ds["age"] > 25] # filter
result = ds[["name", "city"]] # select columns
result = ds.sort_values("revenue", ascending=False) # sort
result = ds.groupby("dept")["salary"].mean() # groupby
result = ds.assign(margin=lambda x: x["profit"] / x["revenue"]) # computed column
ds["name"].str.upper() # string accessor
ds["date"].dt.year # datetime accessor
result = ds1.join(ds2, on="id") # join
result = ds.head(10) # preview
print(ds.to_sql()) # see generated SQL
```

209 DataFrame methods supported. Full API → [api-reference.md](references/api-reference.md)

## Cross-Source Join — The Killer Feature

```python
from datastore import DataStore

customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")

result = (orders
.join(customers, left_on="customer_id", right_on="id")
.groupby("country")
.agg({"amount": "sum", "rating": "mean"})
.sort_values("sum", ascending=False))
print(result)
```

More join examples → [examples.md](examples/examples.md)

## Writing Data

```python
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="summary.parquet", format="Parquet")

target.insert_into("category", "total", "count").select_from(
source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()
```

## Troubleshooting

| Problem | Fix |
|---------|-----|
| `ImportError: No module named 'chdb'` | `pip install chdb` |
| `ImportError: cannot import 'DataStore'` | Use `from datastore import DataStore` or `from chdb.datastore import DataStore` |
| Database connection timeout | Include port in host: `host="db:3306"` not `host="db"` |
| Join returns empty result | Check key types match (both int or both string); use `.to_sql()` to inspect |
| Unexpected results | Call `ds.to_sql()` to see the generated SQL and debug |
| Environment check | Run `python scripts/verify_install.py` (from skill directory) |

## References

- [API Reference](references/api-reference.md) — Full DataStore method signatures
- [Connectors](references/connectors.md) — All 16+ data source connection methods
- [Examples](examples/examples.md) — 10+ runnable examples with expected output
- [Verify Install](scripts/verify_install.py) — Environment verification script
- [Official Docs](https://clickhouse.com/docs/chdb)

> Note: This skill teaches how to *use* chdb DataStore.
> For raw SQL queries, use the `chdb-sql` skill.
> For contributing to chdb source code, see CLAUDE.md in the project root.
Loading
Loading