How Disaster Recovery (DR) Works in TiDB

In my previous post, we introduced the evolution of disaster recovery (DR) technologies for databases. In this post, we’ll focus on TiDB’s DR solutions. We’ll also analyze the pros and cons of these solutions as well as the most suitable deployment options.

TiDB’s storage architecture

Before we deep dive into TiDB’s DR solutions, let’s explore what TiDB is and how it processes and stores data.

TiDB is an open source, distributed SQL database with horizontal scalability, online transactional and analytical processing, and built-in high availability. It’s also fully MySQL compatible and features strong consistency.

TiDB is made up of the following key components:

Placement Driver (PD): the brain of the whole TiDB system, which stores and manages the metadata, and sends data scheduling commands to TiKV.
TiDB server: a stateless SQL layer used for computing, SQL analysis, and transmitting requests to TiKV or TiFlash.
TiKV server: the row store.
TiFlash server: the columnar store.

Fig 1. TiDB’s architecture

TiDB adopts the Raft consensus protocol. It distributes and stores multiple copies of the data (three copies by default) equally on different TiKV nodes. The smallest segment of the data stored on TiKV nodes is called a TiDB region.

TiDB’s DR solutions

To meet different customer needs, TiDB offers the following DR solutions:

TiCDC DR with a 1:1 architecture
Multi-replica DR with a 2-2-1 architecture
TiCDC DR with a 2-2-1:1 architecture

What are these solutions? How do they achieve database failover? What scenarios are they most suitable for? We’ll explore each one in the following sections.

TiCDC DR with a 1:1 architecture

TiCDC is TiDB’s change data capture tool. It replicates TiDB’s incremental data. In this architecture, there are two TiDB clusters deployed in two separate Regions. (Note: A “Region” represents a geographical area.)

Cluster 1, deployed in Region 1, has three data replicas and serves reads and writes.
Cluster 2 is deployed in Region 2 and also has three data replicas. It serves as a DR cluster and can also handle read-only services with a slight delay. TiCDC synchronizes the data changes between the two clusters.

TiFlash server: the columnar store.

synchronizes the data changes between the two clusters.

Fig 2. TiCDC DR with a 1:1 architecture

This is TiDB’s most recommended DR solution. It is available and reliable. It can realize single-region fault tolerance with second-level recovery point objective (RPO) and minute-level—or even lower—recovery time objective (RTO). If customers’ requirements are single-region fault tolerance and none-zero RPO, this architecture is definitely more cost-effective. It also realizes read and write scalability on top of disaster recovery.

Multi-replica DR with a 2-2-1 architecture

The diagram below illustrates the 2-2-1 architecture. In it, the entire TiDB cluster spans three regions. Region 1 and 2 both contain two copies of data, and Region 3 has one. These data replicas are stored separately in different availability zones (AZs). In general, the network between AZs is fast and steady.

Fig 3. Multi-replica DR with a 2-2-1 architecture

In this architecture:

Region 1 handles read and write requests.
Region 2 is used for failover after a disaster occurs in Region 1. It can also cover some read loads that are not sensitive to latency.
Region 3 is similar to a quorum replica that guarantees the Raft-based consensus protocol can be reached even when Region 1 is down.

This architecture can target Region level fault tolerance, with zero RPO, and minute-level or even shorter RTO. If you need RPO=0, we recommend this DR solution.

The biggest disadvantage of this solution is that the database response time will be impacted by the network latency between Region 1 and Region 2. This is because a transaction can not be returned after its data change is acknowledged by a TiKV node in Region 2.

TiCDC DR with a 2-2-1:1 architecture

In this architecture, there are two TiDB clusters. Cluster 1 includes five data replicas and spans three regions:

Region 1 contains two data replicas, which serve writing workloads.
Region 2 has two replicas for DR in case Region 1 goes down. It can also serve some latency-insensitive reading requests.
Region 3 stores the last data replica, which is used for voting.

Fig 4. TiCDC DR with a 2-2-1:1 architecture

Cluster 2 is the DR cluster for Region 1 and 2. It contains three data replicas and runs in Region 3. Data changes are synchronized between the two clusters through TiCDC.

This deployment looks complex. However, it can raise the fault tolerance target to multiple regions, with second-level RPO and minute-level RTO. If you want your system to achieve multi-region fault tolerance and do not require zero RPO, this solution is a perfect fit.

Summary of TiDB’s key DR solutions

The below table summarizes the advantages and disadvantages of each solution.

TiDB’s DR solution	Advantages	Disadvantages
TiCDC DR with a 1:1 architecture	Single-region fault tolerance Second-level RPO Minute-level or even lower RTO More resilient than the 2-2-1 architecture Read and write scalability Cost-effective	Cannot survive multi-Region disasters
Multi-replica DR with a 2-2-1 architecture	Single-region fault tolerance Zero RPO Minute-level or even lower RTO Cost-effective	Database response time regression Cannot survive multi-region disasters
TiCDC DR with a 2-2-1:1 architecture	Multi-region fault tolerance Second-level RPO Minute-level RTO	Complex architecture Costly

Table 1. The advantages and limitations of TiDB’s DR solutions

In addition to these DR solutions, TiDB provides a dual-site DR solution called DR Auto-sync. It is based on the Learner replica of Raft log and the Backup & Restore (BR)-based, dual-domain backup redundancy solution on TiDB Cloud. For more information about these two DR solutions and their disaster scenarios, refer to their online documentation.

Conclusion

In future articles, we’ll compare TiDB’s DR solutions with other distributed SQL databases. We’ll also discuss what distributed SQL databases have brought to DR from a consensus protocol perspective, TiDB’s unique approach, and TiDB’s future plans for DR development. Stay tuned!

If you have any questions or feedback about TiDB or its DR solutions, feel free to contact us through Twitter, LinkedIn, or our Slack Channel. If you’re interested in our products and services, you’re welcome to request a demo from one of our experts or start a free trial.

Keep reading:
Technical Paths to HTAP: GreenPlum and TiDB as Examples
How Good is TiDB as an HTAP System? A HATtrick Benchmark
Simplify Database Geo-redundancy Backup with Cloud Storage Services
Disaster Recovery for Databases: How it Evolves Over Years

Book a Demo