TiDB Lightning is a tool used for fast full import of large amounts of data into a TiDB cluster. Currently, TiDB Lightning supports reading SQL dump exported via Mydumper or CSV data source. You can use it in the following two scenarios:
The TiDB Lightning tool set consists of two components:
tidb-lightning (the “front end”) reads the data source and imports the database structure into the TiDB cluster, and also transforms the data into Key-Value (KV) pairs and sends them to
tikv-importer (the “back end”) combines and sorts the KV pairs and then imports these sorted pairs as a whole into the TiKV cluster.
The complete import process is as follows:
tidb-lightning switches the TiKV cluster to “import mode”, which optimizes the cluster for writing and disables automatic compaction.
tidb-lightning creates the skeleton of all tables from the data source.
Each table is split into multiple continuous batches, so that data from a huge table (200 GB+) can be delivered incrementally.
For each batch,
tikv-importer via gRPC to create engine files to store KV pairs.
tidb-lightning then reads the data source in parallel, transforms each row into KV pairs according to the TiDB rules, and sends them to
tikv-importer's engine files.
Once a complete engine file is written,
tikv-importer divides and schedules these data and imports them into the target TiKV cluster.
There are two kinds of engine files: data engines and index engines, each corresponding to two kinds of KV pairs: the row data and secondary indices. Normally, the row data are entirely sorted in the data source, while the secondary indices are out of order. Because of this, the data engines are uploaded as soon as a batch is completed, while the index engines are imported only after all batches of the entire table are encoded.
After all engines associated to a table are imported,
tidb-lightning performs a checksum comparison between the local data source and those calculated from the cluster, to ensure there is no data corruption in the process; tells TiDB to
ANALYZE all imported tables, to prepare for optimal query planning; and adjusts the
AUTO_INCREMENT value so future insertions will not cause conflict.
The auto-increment ID of a table is computed by the estimated upper bound of the number of rows, which is proportional to the total file size of the data files of the table. Therefore, the final auto-increment ID is often much larger than the actual number of rows. This is expected since in TiDB auto-increment is not necessarily allocated sequentially.
tidb-lightning switches the TiKV cluster back to “normal mode”, so the cluster resumes normal services.