In this section we are going to cover two topics which often get grouped together. Monitoring is about detecting problems and anomalies. Observability is about looking into the system and being able to see how it operates. Most systems can be monitored, but not all are necessarily designed to be observed. If we use a microscope as an analogy, observability is the slide.

So let's take a look at monitoring first. With all the components of TiDB being highly available, there is clearly monitoring happening somewhere.

For both the PD server and TiKV servers, high availability is provided by Raft consensus. The Raft group uses a minimum of 3 nodes, so that a leader can be elected without a split-vote occurring. I will link to a description of the algorithm in Wikipedia in the resources, but we can say that the cluster components themselves provide high availability.

The TiDB server has no state or concept of membership, so it does not require Raft for high availability. Instead, it relies on Kubernetes to provide it with replicas, and a tcp load balancer to route clients to a healthy server. Unhealthy servers can be replaced quickly without moving data around.

So the takeaways for monitoring are:

  • Always ensure that there are three or more instances of PD and TiKV.
  • It is also recommended that the number of instances be an odd number so that a split-vote does not occur.

In the next video we will talk about instrumentation with Prometheus and Grafana.