TiDB's Backup and Recovery Technology

Data loss can occur while using a database for various reasons, such as operator error, malicious hacker attacks, and server hardware failures. Backup and recovery technology is the final line of defense to ensure data can still be restored and used after such losses.

TiDB, as a native distributed database, fully supports various backup and recovery capabilities. However, due to its unique architecture, the principles of its backup and recovery processes differ from those of traditional databases.

This article gives a comprehensive overview of TiDB’s backup and recovery capabilities.

TiDB’s Backup Capabilities

TiDB offers two types of backups: physical backups and logical backups.

Physical backups involve directly backing up physical files (.SST) and can be divided into full and incremental backups. Logical backups export data to binary or text files. Physical backups are typically used for cluster-level or database-level backups involving large amounts of data to ensure the consistency of the backed-up data.

Logical backups are primarily used for full backups of smaller data sets or fewer tables and do not guarantee data consistency during ongoing operations.

Physical Backups

Physical backups are divided into full backups and incremental backups. Full backups, also known as “snapshot backups,” ensure data consistency through snapshots. Incremental backups, referred to as “log backups” in the current TiDB version, back up the KV change logs over a recent period.

Snapshot Backups

Full Process of Snapshot Backup

BR Receives Backup Command:
- BR receives the br backup full command and obtains the backup snapshot point and backup storage address.
BR Schedules Backup Data:
- Specific steps include:
  1. Pausing GC to prevent the backed-up data from being collected as garbage.
  2. Accessing PD to get information about the distribution of the Regions to be backed up and the TiKV node information.
  3. Creating a backup request and sending it to the TiKV nodes, including backup ts, the regions to be backed up, and the backup storage address.
TiKV Accepts Backup Request and Initializes Backup Worker:
- TiKV nodes receive the backup request and initialize a backup worker.
TiKV Backs Up Data:
- Specific steps include:
  1. Reading data: The backup worker reads data corresponding to backup ts from the Region Leader.
  2. Saving to SST files: The data is stored in memory as SST files.
  3. Uploading SST files to backup storage.
BR Retrieves Backup Results from Each TiKV:
- BR collects the backup results from each TiKV node. If there are changes in Regions, the process retries. If it is not possible to retry, the backup fails.
BR Backs Up Metadata:
- BR backs up the table schema, calculates the table data checksum, generates backup metadata, and uploads it to the backup storage.

The recommended method is to use the br command-line tool provided by TiDB to perform a snapshot backup. You can install it using tiup install br. After installing br, you can use the related commands to perform a snapshot backup. Currently, snapshot backups support cluster-level, database-level, and table-level backups. Here is an example of using br for a cluster snapshot backup.

[tidb@tidb53 ~]$ tiup br backup full --pd "172.20.12.52:2679" --storage "local:///data1/backups" --ratelimit 128 --log-file backupfull.log
tiup is checking updates for component br ...
Starting component `br`: /home/tidb/.tiup/components/br/v7.6.0/br backup full --pd 172.20.12.52:2679 --storage local:///data1/backups --ratelimit 128 --log-file backupfull.log
Detail BR log in backupfull.log
[2024/03/05 10:19:27.437 +08:00] [WARN] [backup.go:311] ["setting `--ratelimit` and `--concurrency` at the same time, ignoring `--concurrency`: `--ratelimit` forces sequential (i.e. concurrency = 1) backup"] [ratelimit=134.2MB/s] [concurrency-specified=4]
Full Backup <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/03/05 10:20:29.456 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=207] [ranges-succeed=207] [ranges-failed=0] [backup-checksum=1.422780807s] [backup-fast-checksum=17.004817ms] [backup-total-ranges=161] [total-take=1m2.023929601s] [BackupTS=448162737288380420] [total-kv=25879266] [total-kv-size=3.587GB] [average-speed=57.82MB/s] [backup-data-size(after-compressed)=1.868GB] [Size=1867508767]
[tidb@tidb53 ~]$ ll /data1/backups/
总用量 468
drwxr-xr-x. 2 nfsnobody nfsnobody  20480 3月   5 10:20 1
drwxr-xr-x. 2 tidb      tidb       12288 3月   5 10:20 4
drwxr-xr-x. 2 nfsnobody nfsnobody  12288 3月   5 10:20 5
-rw-r--r--. 1 nfsnobody nfsnobody     78 3月   5 10:19 backup.lock
-rw-r--r--. 1 nfsnobody nfsnobody    395 3月   5 10:20 backupmeta
-rw-r--r--. 1 nfsnobody nfsnobody  50848 3月   5 10:20 backupmeta.datafile.000000001
-rw-r--r--. 1 nfsnobody nfsnobody 365393 3月   5 10:20 backupmeta.schema.000000002
drwxrwxrwx. 3 nfsnobody nfsnobody   4096 3月   5 10:19 checkpoints

The --ratelimit parameter indicates the maximum speed at which each TiKV can execute backup tasks, set to 128MB/s. The --log-file parameter specifies the target file to write the backup logs. The --pd parameter specifies the PD nodes. Additionally, the br command supports the --backupts parameter, which indicates the physical time point corresponding to the backup snapshot. If this parameter is not specified, the current time point is used as the snapshot time point.

If we want to determine when a completed snapshot backup was taken from a backup set, br also provides the corresponding command br validate decode. This command’s output is a TSO (Timestamp Oracle). We can use tidb_parse_tso to parse it into the physical time, as shown below.

[tidb@tidb53 ~]$ tiup br validate decode --field="end-version" --storage "local:///data1/backups" | tail -n1
tiup is checking updates for component br ...
Starting component `br`: /home/tidb/.tiup/components/br/v7.6.0/br validate decode --field=end-version --storage local:///data1/backups
Detail BR log in /tmp/br.log.2024-03-05T10.24.25+0800
448162737288380420
mysql> select tidb_parse_tso(448162737288380420);
+------------------------------------+
| tidb_parse_tso(448162737288380420) |
+------------------------------------+
| 2024-03-05 10:19:28.489000         |
+------------------------------------+
1 row in set (0.01 sec)

Log Backup

Full Process of Log Backup

BR Receives Backup Command:
- BR receives the br log start command. It parses and obtains the checkpoint timestamp and backup storage address for the log backup task and registers it in PD.
TiKV Monitors Log Backup Task Creation and Updates:
- Each TiKV node’s log backup observer listens for the creation and updates of log backup tasks in PD and backs up the data within the backup range on that node.
TiKV Log Backup Observer Continuously Backs Up KV Change Logs:
- Specific steps include:
  1. Reading KV data changes and saving them to a custom format backup file.
  2. Periodically querying the global checkpoint timestamp from PD.
  3. Periodically generating local metadata.
  4. Periodically uploading log backup data and local metadata to the backup storage.
  5. Requesting PD to prevent unbacked data from being garbage collected.
TiDB Coordinator Monitors Log Backup Progress:
- It polls all TiKV nodes to get the backup progress for each region. Based on the region checkpoint timestamps, the overall progress of the log backup task is calculated and uploaded to PD.
PD Persists Log Backup Task Status:
- The status of the log backup task can be queried using br log status.

Log Backup Method:

Snapshot backup commands start with br backup ..., while log backup commands start with br log .... To start a log backup, use the command br log start. After initiating a log backup task, use br log status to check the status of the log backup task.

[tidb@tidb53 ~]$ tiup br validate decode --field="end-version" --storage "local:///data1/backups" | tail -n1
tiup is checking updates for component br ...
Starting component `br`: /home/tidb/.tiup/components/br/v7.6.0/br validate decode --field=end-version --storage local:///data1/backups
Detail BR log in /tmp/br.log.2024-03-05T10.24.25+0800
448162737288380420
mysql> select tidb_parse_tso(448162737288380420);
+------------------------------------+
| tidb_parse_tso(448162737288380420) |
+------------------------------------+
| 2024-03-05 10:19:28.489000         |
+------------------------------------+
1 row in set (0.01 sec)

In the above commands, the --task-name parameter specifies the name of the log backup task, the --pd parameter specifies the PD nodes, and the --storage parameter specifies the log backup storage address. The br log command also supports the --start-ts parameter, specifying the log backup’s start time. If not specified, the current time is used as the start-ts.

[tidb@tidb53 ~]$ tiup br log status --task-name=pitr --pd "172.20.12.52:2679"
tiup is checking updates for component br ...
Starting component `br`: /home/tidb/.tiup/components/br/v7.6.0/br log status --task-name=pitr --pd 172.20.12.52:2679
Detail BR log in /tmp/br.log.2024-03-05T10.56.28+0800
● Total 1 Tasks.
> #1 <
              name: pitr
            status: ● NORMAL
             start: 2024-03-05 10:50:52.939 +0800
               end: 2090-11-18 22:07:45.624 +0800
           storage: local:///data1/backups/pitr
       speed(est.): 0.00 ops/s
checkpoint[global]: 2024-03-05 10:55:42.69 +0800; gap=47s
[tidb@tidb53 ~]$ tiup br log status --task-name=pitr --pd "172.20.12.52:2679"
tiup is checking updates for component br ...
Starting component `br`: /home/tidb/.tiup/components/br/v7.6.0/br log status --task-name=pitr --pd 172.20.12.52:2679
Detail BR log in /tmp/br.log.2024-03-05T10.58.57+0800
● Total 1 Tasks.
> #1 <
              name: pitr
            status: ● NORMAL
             start: 2024-03-05 10:50:52.939 +0800
               end: 2090-11-18 22:07:45.624 +0800
           storage: local:///data1/backups/pitr
       speed(est.): 0.00 ops/s
checkpoint[global]: 2024-03-05 10:58:07.74 +0800; gap=51s

The above output shows that the log backup status is normal. Comparing the outputs at different times reveals that the log backup task is indeed being executed periodically in the background. The checkpoint[global] indicates that all data in the cluster has been saved in the backup storage earlier than this checkpoint time. It represents the most recent time from which the backup data can be restored.

Logical Backup

Logical backup can be used to extract the data from TiDB’s SQL statements or export tools. In addition to commonly used export statements, TiDB provides a tool called Dumpling, which can export data stored in TiDB or MySQL to SQL or CSV formats. For detailed documentation on Dumpling, please refer to Export Data Using Dumpling | PingCAP Documentation Center. A typical example of Dumpling is as follows, which exports all non-system table data from the target database in SQL file format, with 8 concurrent threads for exporting, an output directory of /tmp/test, intra-table concurrency to speed up the export starting at 200K records, and a maximum single file size of 256MB.

dumpling -u root -P 4000 -h 127.0.0.1 --filetype sql -t 8 -o /tmp/test -r 200000 -F256MiB

TiDB’s Recovery Capabilities

TiDB recovery can be divided into physical backup-based recovery and logical backup-based recovery. Physical backup-based recovery refers to using the br restore command line to restore data, typically for large-scale complete data restoration. Logical backup-based recovery involves importing data, such as files exported by Dumpling, into the cluster, usually for small data sets or a few tables.

Physical Recovery

Physical recovery can be categorized into direct snapshot backup recovery and Point-in-Time Recovery (PITR). Snapshot backup recovery only requires specifying the backup storage path of the snapshot backup. PITR requires specifying the backup storage path (including snapshot and log backup data) and the time you want to restore.

Snapshot Backup Recovery The complete process of snapshot recovery is as follows (already included in the snapshot backup example above):

BR Receives Restore Command:
- BR receives the br restore command, obtains the snapshot backup storage address and the objects to be restored, and checks whether the objects to be restored exist and meet the requirements.
BR Schedules Data Restoration:
- Specific steps include:
  1. Requesting PD to disable automatic Region scheduling.
  2. Reading and restoring the schema of the backup data.
  3. Requesting PD to allocate Regions based on the backup data information and distribute the Regions to TiKV.
  4. Sending restoration requests to TiKV based on the Regions allocated by PD.
TiKV Accepts Restore Request and Initializes Restore Worker:
- TiKV nodes receive the restore request and initialize a restore worker.
TiKV Restores Data:
- Specific steps include:
  1. Downloading data from backup storage to the local machine.
  2. Restore workers rewrite the backup data kv (replacing table id and index id).
  3. Injecting the processed SST files into RocksDB.
  4. Returning the restoration result to BR.
BR Retrieves Restoration Results from Each TiKV:

Method of Snapshot Backup Recovery Snapshot backup recovery can be performed at the cluster, database, and table levels. It is recommended to restore to an empty cluster; if the objects to be restored already exist in the cluster, it will cause a restoration error (except for system tables). Below is an example of a cluster restoration:

[tidb@tidb53 ~]$ tiup br restore full --pd "172.20.12.52:2679" --storage "local:///data1/backups"
tiup is checking updates for component br ...
Starting component `br`: /home/tidb/.tiup/components/br/v7.6.0/br restore full --pd 172.20.12.52:2679 --storage local:///data1/backups
Detail BR log in /tmp/br.log.2024-03-05T13.08.08+0800
Full Restore <---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/03/05 13:08:27.918 +08:00] [INFO] [collector.go:77] ["Full Restore success summary"] [total-ranges=197] [ranges-succeed=197] [ranges-failed=0] [split-region=786.659µs] [restore-ranges=160] [total-take=19.347776543s] [RestoreTS=448165390238351361] [total-kv=25811349] [total-kv-size=3.561GB] [average-speed=184MB/s] [restore-data-size(after-compressed)=1.847GB] [Size=1846609490] [BackupTS=448162737288380420]

If you want to restore a single database, you just need to add the --db parameter to the restore command. To restore a single table, you must add both the --db and --table parameters to the restore command.

PITR (Point-in-Time Recovery) The command for PITR is br restore point .... For initializing the restoration cluster, you must specify the snapshot backup using the --full-backup-storage parameter to indicate the storage address of the snapshot backup. The --restored-ts parameter specifies the point in time you want to restore. If this parameter is not specified, the restoration will be done at the latest recoverable time point. Additionally, if you only want to restore log backup data, you need to use the --start-ts parameter to specify the starting time point for the log backup restoration.

Here is an example of a point-in-time recovery that includes snapshot recovery:

[tidb@tidb53 ~]$ tiup br restore point --pd "172.20.12.52:2679" --full-backup-storage "local:///data1/backups/fullbk" --storage "local:///data1/backups/pitr" --restored-ts "2024-03-05 13:38:28+0800"
tiup is checking updates for component br ...
Starting component `br`: /home/tidb/.tiup/components/br/v7.6.0/br restore point --pd 172.20.12.52:2679 --full-backup-storage local:///data1/backups/fullbk --storage local:///data1/backups/pitr --restored-ts 2024-03-05 13:38:28+0800
Detail BR log in /tmp/br.log.2024-03-05T13.45.02+0800
Full Restore <---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/03/05 13:45:24.620 +08:00] [INFO] [collector.go:77] ["Full Restore success summary"] [total-ranges=111] [ranges-succeed=111] [ranges-failed=0] [split-region=644.837µs] [restore-ranges=75] [total-take=21.653726346s] [BackupTS=448165866711285765] [RestoreTS=448165971332694017] [total-kv=25811349] [total-kv-size=3.561GB] [average-speed=164.4MB/s] [restore-data-size(after-compressed)=1.846GB] [Size=1846489912]
Restore Meta Files <......................................................................................................................................................................................> 100%
Restore KV Files <........................................................................................................................................................................................> 100%
[2024/03/05 13:45:26.944 +08:00] [INFO] [collector.go:77] ["restore log success summary"] [total-take=2.323796546s] [restore-from=448165866711285765] [restore-to=448165867159552000] [restore-from="2024-03-05 13:38:26.29 +0800"] [restore-to="2024-03-05 13:38:28 +0800"] [total-kv-count=0] [skipped-kv-count-by-checkpoint=0] [total-size=0B] [skipped-size-by-checkpoint=0B] [average-speed=0B/s]

Logical Recovery

Logical recovery can also be understood as data import. Besides general SQL import, TiDB supports importing data using the Lightning tool. Lightning is a tool for importing data from static files into the TiDB cluster, commonly used for initial data import.

For more details, please refer to the official TiDB Lightning Overview | PingCAP Documentation Center.

Conclusion

This article provides a detailed summary of TiDB’s backup and restoration capabilities and basic usage methods. TiDB’s backup and restoration mechanisms ensure data security and consistency and meet data protection needs at various scales and scenarios through flexible strategies and tool support. For more in-depth technical details and operational guidance, please refer to the PingCAP official documentation center.

Experience modern data infrastructure firsthand.

Try TiDB Serverless

Thought Leadership

Have questions? Let us know how we can help.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Serverless

A fully-managed cloud DBaaS for auto-scaling workloads

Start for Free Learn More

A Comprehensive Guide to TiDB’s Backup and Recovery Technology