title	Key Metrics
category	operations

Key Metrics

If you use Ansible to deploy TiDB cluster, you can deploy the monitoring system at the same time. See Overview of the Monitoring Framework for more information.

The Grafana dashboard is divided into four sub dashboards: node_export, PD, TiKV, and TiDB. There are a lot of metics there to help you diagnose. For routine operations, some of the key metrics are displayed on the Overview dashboard so that you can get the overview of the status of the components and the entire cluster. See the following section for their descriptions:

Key metrics description

Service	Panel Name	Description	Normal Range
PD	Storage Capacity	the total storage capacity of the TiDB cluster
PD	Current Storage Size	the occupied storage capacity of the TiDB cluster
PD	Store Status -- up store	the number of TiKV nodes that are up
PD	Store Status -- down store	the number of TiKV nodes that are down	`0`. If the number is bigger than `0`, it means some node(s) are not down.
PD	Store Status -- offline store	the number of TiKV nodes that are manually offline
PD	Store Status -- Tombstone store	the number of TiKV nodes that are Tombstone
PD	Current storage usage	the storage occupancy rate of the TiKV cluster	If it exceeds 80%, you need to consider adding more TiKV nodes.
PD	99% completed cmds duration seconds	the 99th percentile duration to complete a pd-server request	less than 5ms
PD	average completed cmds duration seconds	the average duration to complete a pd-server request	less than 50ms
PD	leader balance ratio	the leader ratio difference of the nodes with the biggest leader ratio and the smallest leader ratio	It is less than 5% for a balanced situation. It becomes bigger when a node is restarting.
PD	region balance ratio	the region ratio difference of the nodes with the biggest region ratio and the smallest region ratio	It is less than 5% for a balanced situation. It becomes bigger when adding or removing a node.
TiDB	handle requests duration seconds	the response time to get TSO from PD	less than 100ms
TiDB	tidb server QPS	the QPS of the cluster	application specific
TiDB	connection count	the number of connections from application servers to the database	Application specific. If the number of connections hops, you need to find out the reasons. If it drops to 0, you can check if the network is broken; if it surges, you need to check the application.
TiDB	statement count	the number of different types of statement within a given time	application specific
TiDB	Query Duration 99th percentile	the 99th percentile query time
TiKV	99% & 99.99% scheduler command duration	the 99th percentile and 99.99th percentile scheduler command duration	For 99%, it is less than 50ms; for 99.99%, it is less than 100ms.
TiKV	95% & 99.99% storage async_request duration	the 95th percentile and 99.99th percentile Raft command duration	For 95%, it is less than 50ms; for 99.99%, it is less than 100ms.
TiKV	server report failure message	There might be an issue with the network or the message might not come from this cluster.	If there are large amount of messages which contains `unreachable`, there might be an issue with the network. If the message contains `store not match`, the message does not come from this cluster.
TiKV	Vote	the frequency of the Raft vote	Usually, the value only changes when there is a split. If the value of Vote remains high for a long time, the system might have a severe issue and some nodes are not working.
TiKV	95% and 99% coprocessor request duration	the 95th percentile and the 99th percentile coprocessor request duration	Application specific. Usually, the value does not remain high.
TiKV	Pending task	the number of pending tasks	Except for PD worker, it is not normal if the value is too high.
TiKV	stall	RocksDB stall time	If the value is bigger than 0, it means that RocksDB is too busy, and you need to pay attention to IO and CPU usage.
TiKV	channel full	The channel is full and the threads are too busy.	If the value is bigger than 0, the threads are too busy.
TiKV	95% send message duration seconds	the 95th percentile message sending time	less than 50ms
TiKV	leader/region	the number of leader/region per TiKV server	application specific

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key Metrics

Key metrics description

FilesExpand file tree

dashboard-overview-info.md

Latest commit

History

dashboard-overview-info.md

File metadata and controls

Key Metrics

Key metrics description