Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/actions/spelling/allow.txt
Original file line number Diff line number Diff line change
Expand Up @@ -720,4 +720,11 @@ topmetrics
tsdb
xzvf
bson
fileformat
hangzhou
HCFS
namenode
paimon
upserts


9 changes: 9 additions & 0 deletions docs/connectors/supported-data-sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,15 @@ The beta version of the data sources is in public preview and has passed the bas
<td>➖</td>
<td>0.11.0</td>
</tr>
<tr>
<td>Paimon</td>
<td>➖</td>
<td>➖</td>
<td>➖</td>
<td>✅</td>
<td>➖</td>
<td>0.6 and above</td>
</tr>
<tr>
<td>SelectDB</td>
<td>➖</td>
Expand Down
106 changes: 106 additions & 0 deletions docs/connectors/warehouses-and-lake/paimon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Paimon

import Content1 from '../../reuse-content/_enterprise-and-community-features.md';

<Content1 />

Apache Paimon is a lake format that lets you build a real-time Lakehouse with Flink and Spark. TapData can stream data into Paimon tables for an always-up-to-date data lake.

```mdx-code-block
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
```

## Supported versions

Paimon 0.6 and later (0.8.2+ recommended)

## Supported operations

DML only: INSERT, UPDATE, DELETE

## Supported data types

All Paimon 0.6+ types. To preserve precision, follow the [official docs](https://paimon.apache.org/docs/master/concepts/spec/fileformat/) when mapping columns—for example, use INT32 for DATE in Parquet files.

:::tip
Add a [Type Modification Processor](../../data-transformation/process-node.md#type-modification) to the job if you need to cast columns to a different Paimon type.
:::

## Considerations

- To avoid write conflicts and reduce compaction pressure, disable multi-threaded writes in the target node and set batch size to 1,000 rows and timeout to 1,000 ms.
- Always define a primary key for efficient upserts and deletes; for large tables, use partitioning to speed up queries and writes.
- Paimon supports primary keys only (no secondary indexes) and does not allow runtime schema evolution.

## Connect to Paimon

1. Log in to TapData platform.
2. In the left navigation bar, click **Connections**.
3. On the right side of the page, click **Create**.
4. In the pop-up dialog, search for and select **Paimon**.
5. Fill in the connection details as shown below.

![Connect to Paimon](../../images/connect_paimon.png)

**Basic Settings**
- **Name**: Enter a meaningful and unique name.
- **Type**: Only supports using Paimon as a target database.
- **Warehouse Path**: Enter the root path for Paimon data based on the storage type.
- S3: `s3://bucket/path`
- HDFS: `hdfs://namenode:port/path`
- OSS: `oss://bucket/path`
- Local FS: `/local/path/to/warehouse`
- **Storage Type**: TapData supports S3, HDFS, OSS, and Local FS, with each storage type having its own connection settings.

```mdx-code-block
<Tabs className="unique-tabs">
<TabItem value="S3" default>
```
Use this option for any S3-compatible object store—AWS S3, MinIO, or private-cloud solutions. Supply the endpoint, keys, and region (if required) so TapData can write Paimon data directly to your bucket.
- **S3 Endpoint**: full URL including protocol and port, e.g. `http://192.168.1.57:9000/`
- **S3 Access Key**: the Access-Key ID that owns read/write permission on the bucket/path
- **S3 Secret Key**: the corresponding Secret-Access-Key
- **S3 Region**: the region where the bucket was created, e.g. `us-east-1`

</TabItem>

<TabItem value="HDFS">
Choose this when your warehouse sits on Hadoop HDFS or any HCFS-compatible cluster. TapData writes through the standard HDFS client, so give it the NameNode host/port and the OS user it should impersonate.

- **HDFS Host**: NameNode hostname or IP, e.g. `192.168.1.57`
- **HDFS Port**: NameNode RPC port, e.g. `9000` or `8020`
- **HDFS User**: OS user that TapData will impersonate when writing, e.g. `hadoop`

</TabItem>

<TabItem value="OSS">
Pick this for Alibaba Cloud OSS or any other OSS-compatible provider. Enter the public or VPC endpoint, the access key pair, and TapData will create Paimon files inside the bucket you specify.

- **OSS Endpoint**: VPC or public endpoint, e.g. `https://oss-cn-hangzhou.aliyuncs.com` (do **not** include the bucket name)
- **OSS Access Key**: Access-Key ID that has read/write permission on the bucket/path
- **OSS Secret Key**: the corresponding Access-Key Secret

</TabItem>

<TabItem value="Local">

**Local filesystem**:
Select this option if you want to store the Paimon warehouse on a local disk or an NFS mount that is visible to the TapData server. Make sure the directory is writable by the TapData OS user and that enough free space is available for both data and compaction temporary files.

</TabItem>
</Tabs>

- **Database Name**: one connection maps to one database (default is `default`). Create extra connections for additional databases.

**Advanced Settings**
- **Agent Settings**: Defaults to **Platform automatic allocation**, you can also manually specify an agent.
- **Model Load Time**: If there are less than 10,000 models in the data source, their schema will be updated every hour. But if the number of models exceeds 10,000, the refresh will take place daily at the time you have specified.

6. Click **Test** at the bottom; after it passes, click **Save**.

:::tip

If the test fails, follow the on-screen hints to fix the issue.

:::
4 changes: 3 additions & 1 deletion docs/data-replication/create-task.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,9 @@ As an example of creating a data replication task, the article demonstrates the
* **Incremental Multi-thread Writing**: The number of concurrent threads for writing incremental data.
* **Batch Write Item Quantity**: The number of items written per batch during full synchronization.
* **Max Wait Time per Batch Write**: Set the maximum waiting time per batch write, evaluated based on the target database’s performance and network latency, in milliseconds.
* <span id="advanced-settings">**Advanced Settings**</span>

* **<span id="advanced-settings">Advanced Settings</span>**

* **Data Writing Mode**: Select according to business needs.
* **Process by Event Type**: If you choose this, you also need to select data writing strategies for insert, update, and delete events.
* **Statistical Append Write**: Only processes insert events, discarding update and delete events.
Expand Down
Binary file added docs/images/connect_paimon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 14 additions & 1 deletion docs/introduction/terms.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,4 +76,17 @@ A lightweight runtime component that executes pipelines. It connects to data sou

## TCM (TapData Control Manager)

The centralized management plane for pipeline orchestration, configuration, monitoring, and deployment. Users interact with TCM to create, modify, and observe pipelines.
The centralized management plane for pipeline orchestration, configuration, monitoring, and deployment. Users interact with TCM to create, modify, and observe pipelines.


## QPS

Queries Per Second. The average number of change events the sync task processes every second. It shows how fast data is replicated from the source to the target.

## Incremental Validation

While the task is running, TapData randomly compares rows in the target with the source to make sure they match. The check keeps going as long as the sync is active. See [Incremental Data Check](../data-replication/incremental-check.md).

## API Server

TapData’s built-in publishing layer. Pick any table and expose it as a [RESTful API endpoint](../publish-apis/README.md). Teams use it to share clean, governed data with mobile apps, third-party systems, or any client that speaks HTTP.
58 changes: 57 additions & 1 deletion docs/platform-ops/operation.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,4 +377,60 @@ a data replication task is used for scenarios that only synchronize incremental
* [Data Services](../publish-apis/README.md)
* Deleting or taking an API offline will render it unavailable.
* [System Management](../system-admin/other-settings/system-settings.md)
* When [managing a cluster](../system-admin/manage-cluster.md), only perform close or restart operations on related services when they are experiencing anomalies.
* When [managing a cluster](../system-admin/manage-cluster.md), only perform close or restart operations on related services when they are experiencing anomalies.

## How to run a TapData health check

Use this checklist to confirm TapData is running normally.

1. Log in to TapData.

2. In the left menu choose **System Management > Cluster Management** and verify [component status](../system-admin/manage-cluster.md):
- TapData Manager, Engine, and API Server are all **Running**.
- CPU and memory are below 70 %.

3. Open **Data Replication** or **Data Transformation** and scan the task list:
- Every task should show **Running**.
- Click a task name and check [metrics](../data-replication/monitor-task.md): lag is acceptable and QPS > 0.

If a task is unhealthy:
- **Read the error log** at the bottom of the monitor page and follow the hints. See [troubleshooting](../platform-ops/troubleshooting/README.md).
- **Test the connection**: open **Connections**, click **Test** on the related source/target and fix any auth or network issues.
- **Check incremental lag**: if QPS spikes for > 30 min, the source may be in a batch window—consider scaling the task. If the target receives no changes, verify CDC prerequisites (e.g. MySQL binlog = ROW). Primary-key conflicts in the log usually mean a config change.

Still stuck? [Contact support](../appendix/support.md).


## How to handle TapData alerts

TapData sends alerts by [email](../case-practices/best-practice/alert-via-qqmail.md). Use the subject line to pick the right playbook below.

**Task-state alerts**

| Alert | What it means | What to do |
| --- | --- | --- |
| **Task error** | Task stopped; replication is down. | Open the task → Logs, fix the issue, restart. Escalate if stuck. |
| **Full load finished** | Bulk copy is done. | Info only. Run a data-validate task if you need a checksum. |
| **Incremental started** | Task is now streaming changes. | Info only. |
| **Task stopped** | Someone clicked Stop. | Restart if it was accidental. |

**Replication-lag alert**

Lag exceeds the threshold you set. Open the task monitor and look for:

- **Slow source reads** – “Read time” is high → ask the DBA to check load or network.
- **Slow target writes / high QPS** – raise “Incremental read size” (≤1 000) and “Batch write size” (≤10 000); keep Agent memory <70 %.
- **False lag** – QPS is 0 but lag still climbs → enable [heartbeat table](../case-practices/best-practice/heart-beat-task.md) on the source.
- **Slow engine** – “Process time” keeps rising → optimise JS code or open a ticket.

**Validation & performance alerts**

| Alert | What it means | What to do |
| --- | --- | --- |
| **Validation diff** | Incremental compare found mismatches. | Auto-repair is on? Do nothing. Otherwise open the task and click **Repair**. |
| **Data-source node slow** | Source/target latency high. | If lag alert fired, treat as “slow source reads” above; else watch and loop in the DBA if lag appears. |
| **Process node slow** | JS node is the bottleneck. | Optimise logic or open a ticket if lag follows. |
| **Validation job error** | Compare task crashed. | Doesn’t affect replication; restart the validation job. Escalate if it keeps failing. |
| **Count diff limit exceeded** | Row counts don’t match. | **Full-sync task**: switch to full-field compare to pinpoint rows. **Incremental task**: wait 1–2 lag cycles and re-validate; repair if the gap remains. |
| **Field diff limit exceeded** | Same as above but field-level. | Same playbook. |
| **Task retry limit** | Task retried and still failed. | Open the task, follow the error message; escalate if you can’t clear it. |
20 changes: 20 additions & 0 deletions docs/publish-apis/query/query-via-restful.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,23 @@ If you'd prefer to use an external tool or automate API testing, [Postman](https
6. Click **Send**. You’ll get a real-time response from the API.

![Query Result](../../images/restful_api_query_result.png)


## Common response codes

| Code | Message | Meaning |
| --- | --- | --- |
| 200 | OK | Request succeeded |
| 401 | Unauthorized error: token expired | Token expired; generate a new one |
| 404 | Not Found error: endpoint not found | API does not exist or is not yet published—check the URL or wait for the publish to finish |
| 429 | Rate limit exceeded. Maximum \${api limit} requests per second allowed | You hit the rate limit; retry later or raise the limit in the API settings |

## FAQ

* Q: The API takes too long to return data or times out

A: Add indexes on every column used in `WHERE`, `ORDER BY`, or joins. If the delay persists, enable response caching or increase the query timeout in the API settings.

* Q: The payload doesn’t look right

A: Check the data-source model and the underlying table—make sure the data is current and that any field-merging logic matches what you expect.
1 change: 1 addition & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ const sidebars = {
'connectors/warehouses-and-lake/gaussdb',
'connectors/warehouses-and-lake/greenplum',
'connectors/warehouses-and-lake/hudi',
'connectors/warehouses-and-lake/paimon',
'connectors/warehouses-and-lake/selectdb',
'connectors/warehouses-and-lake/starrocks',
'connectors/warehouses-and-lake/tablestore',
Expand Down
Loading