From Li SHEN: email@example.com
From the first blog of TiDB internal, we know that TiKV cluster is the distributed KV storage engine of TiDB database. Data is replicated and managed in Regions and each Region has multiple Replicas distributed on different TiKV nodes. Among these replicas, Leader is in charge of read/write and Follower synchronizes the raft log sent by Leader. Now, please think about the following questions:
It is easy to solve the above questions one by one, but once mixed up, it becomes difficult. It seems that some questions just need to consider the internal situation of a single Raft Group, for example, whether to add replicas is determined by if the number is enough. But actually, where to add this replica needs a global view. The whole system is changing dynamically: situations like Region splitting, node joining, node failing and hotspot accessing changes occur constantly. The schedule system also needs to keep marching towards the best state. Without a component that can master, schedule and configure the global information, it is hard to meet these needs. Therefore, we need a central node to control and adjust the overall situation of the system. So here comes the Placement Driver (PD) module.
I want to categorize and sort out the previously listed questions. In general, there are two types:
A distributed and highly available storage system must meet the following requirements:
A good distributed system needs to have the following optimizations:
If the first type of requirements are met, the system supports multi-replica disaster recovery, dynamic scalability, tolerance of node failure and automatic disaster recovery. If the second type of requirements are met, the load of the system becomes more balanced and easier to manage. To meet these needs, we need to, first of all, collect enough information, such as the state of each node, information of each Raft Group and the statistics of business access and operation. Then we should set some policies for PD to formulate a schedule plan to meet the previous requirements according to this information and the schedule policy.
The basic operations of schedule are the simplest. In other word, what we can do to meet the schedule policy. This is the essence of the whole scheduler. The previous scheduler requirements seem to be complicated, but can be generalized into 3 operations:
The Raft protocol happens to meet these requirements: the
TransferLeader commands support the three basic operations.
Schedule depends on the information gathering of the whole cluster. Simply put, we need to know the state of each TiKV node and each Region. TiKV cluster reports two kinds of information to PD:
There are heartbeats between TiKV Store and PD. On the one hand, PD checks whether each Store is active or if there are newly-added Stores through heartbeats. On the other hand, heartbeats carry the state information of this Store, mainly including:
Leader of each Raft Group and PD are connected with heartbeats, which report the state of this Region, including:
Through these two kinds of heartbeats, PD gathers the information of the whole cluster and then makes decisions. What’s more, PD makes more accurate decisions by getting extra information through the management interface. For example, when the heartbeat of a Store is interrupted, PD has no idea whether it is temporarily or permanently. PD can only waits for a period of time (30 minutes by default); if there is still no heartbeat, PD considers that the Store has been offline and it needs to move all Regions on the Store away. However, if an Operations staff manually offline a machine, he needs to tell PD through its management interface that the Store is unavailable. In this case, PD will immediately move all Regions on the Store away.
After gathering information, PD needs some policies to draw up a concrete schedule plan.
When PD finds that the number of Replica for a Region doesn’t meet the requirement through the heartbeat of a Region Leader, it modifies the number through the Add/Remove Replica operations. This might occur when:
Please pay attention that it is the same place, not the same node. In general, PD can only guarantee that multiple Replicas would not be in the same node, so as to avoid the problem that many Replicas get lost when a node fails. In an actual deployment scenario, the following requirements may come out:
Essentially, what you need is a node that has the common location attribute and constitutes a minimum fault-tolerance unit. We hope that inside this unit, multiple Replicas of a Region will not co-exist. At this time, you can configure labels to nodes and location-labels in PD to designate which label to be the location identifier. When distributing Replicas, the node that stores multiple Replicas of a Region will not have the same location identifier.
Replicas are distributed evenly across Stores As the data storage capacity of each replica is fixed, if we maintain the balance of the number of replica on each node, the overall load will be more balanced.
The number of Leader is distributed evenly across Stores
The Raft protocol reads and writes through Leader, so the computational load is mainly placed on Leader. Therefore, PD manages to distribute Leader among different stores.
When submitting information, each Store and Region Leader carry the information of the current access load, such as the read/write speed of Key. PD checks the hotspots and distributes them across nodes.
Each Store specifies a Capacity parameter when starting, which indicates the limit of the storage space of this Store. PD considers the remaining space of the node when scheduling.
As scheduling operation consumes CPU, memory, disk I/O, and network bandwidth, we should not affect the online service. PD controls the number of ongoing operations and the default speed is conservative. If you want to speed up the scheduling (stop the service upgrade, add new nodes, wish to schedule as soon as possible, etc.), then you can manually accelerate it through pd-ctl.
When offlining a node manually through pd-ctl, PD will move the data on the node away within a certain rate control. After that, it will put the node offline.
Now let’s see the schedule process.
PD gets the detail data of the cluster by constantly gathering information through heartbeats of Store or Leader. Based on this information and the schedule policies, PD generates the operating sequence, and then checks whether there is an operation to be performed on this Region when receiving a heartbeat sent by the Region Leader. PD returns the upcoming operation to Region Leader through the reply message of the heartbeat and then monitors the execution result in the next heartbeat. These operations are just suggestions to Region Leader, which are not guaranteed to be executed. It is the Region Leader that decides to whether and when to execute according to its current state.
This blog discloses information you might not find elsewhere. We hope that you’ve had a better understanding about what needs to be considered to build a distributed storage system for scheduling and how to decouple policies and implementation to support a more flexible expansion of policy.
We hope these three blogs (Data Storage, Computing, and Scheduling) can help you understand the basic concepts and implementation principles of TiDB. In the future, more blogs about TiDB from code to architecture are on their way!