-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Hi @pjhampton ,
Following the feedback on PR #9, I wanted to transition the discussion to the architectural "why" behind those skills. My goal is to bridge the gap where general-purpose LLMs fail to utilize ClickHouse’s specific strengths, particularly in Time-Series Analytics.
The Problem: The "Generic DDL" Trap
Standard LLMs default to standard SQL patterns. For high-volume time-series data, this leads to:
Query-Time Latency: Agents suggest simple SELECT queries on raw data rather than shifting compute to insert-time.
Storage Inefficiency: Missing opportunities for column-level codecs that can reduce disk footprint by over 80%.
Proposed Architectural Skills (Non-Executable Rules):
These can be implemented as purely descriptive rules or prompts, requiring no code execution in your AI cloud:
- data-modeling-materialized-views
Guides the agent to recognize heavy aggregation workloads (e.g., IoT sensor heartbeats) and enforce the Target Table + MV pattern.
Example Heuristic: If the workload requires COUNT/SUM on time-intervals, the agent must recommend SummingMergeTree or AggregatingMergeTree.
- schema-compression-codecs
Codifies specific heuristics for high-volume columns that go beyond default LZ4.
Timestamp Columns: Enforce DoubleDelta for monotonic sequences.
Metric/Sensor Data: Enforce Gorilla for floating-point values.
Low-Cardinality Strings: Enforce LowCardinality(String) with ZSTD.
Why this matters:
In our testing with IoT-scale time-series data, applying these specific codecs via the agent reduced storage requirements by ~75% compared to generic AI-generated DDL.
I’d love to discuss how we can codify these as "High-Performance Rules" within the new contributing guide you are developing.