tidb2dw: Replicate Data From TiDB to Third-Party Data Warehouses

TiDB is an advanced open-source, distributed SQL database engineered to support Hybrid Transactional and Analytical Processing (HTAP) workloads. However, there could be scenarios where you’re looking to weave TiDB into a mature big-data platform, for example, replicate data from TiDB to Snowflake. This could be for streamlined migrations or to speed up integration with pre-existing tools.

To meet these needs, we’re excited to introduce tidb2dw, an open-source tool designed to synchronize TiDB data effortlessly with third-party data warehouses such as Snowflake and Amazon Redshift.

In this post, we’ll guide you through establishing a data replication pipeline from TiDB to Snowflake utilizing tidb2dw. With this configuration, you can tap into the Snowflake ecosystem to engage with your TiDB cluster data. To help you better understand the tool, we will also dive deep into how tidb2dw works.

Preparing your environment

Before diving into the guide, ensure you have the following prerequisites in place:

A TiDB cluster with TiCDC
- A TiDB Cluster featuring TiCDC. To add TiCDC to an existing cluster, refer to the TiCDC documentation.
An AWS account
- A S3 bucket (e.g., s3://my-demo-bucket). This serves as a data bridge between TiDB and Snowflake.
- IAM Key credentials (both Access Key and Secret Key) for programmatic access to the S3 bucket from TiDB and Snowflake.
Snowflake account:
- Ensure you have your Snowflake Organization ID and Account ID. You can locate these IDs in the Snowflake Accounts page.

Replicating Data to Snowflake

Take the steps below to get started with tidb2dw.

Step 1: Install tidb2dw

For Linux (AMD64)

For Linux (ARM64)

For MacOS (AMD64)

wget https://github.com/pingcap-inc/tidb2dw/releases/download/v0.0.2/tidb2dw-v0.0.2-linux-amd64

wget https://github.com/pingcap-inc/tidb2dw/releases/download/v0.0.2/tidb2dw-v0.0.2-linux-arm64

wget https://github.com/pingcap-inc/tidb2dw/releases/download/v0.0.2/tidb2dw-v0.0.2-darwin-amd64

Step 2: Run tidb2dw

tidb2dw automates both snapshot data loading and incremental data merging. As an example, we will run the following script to replicate a single table – the test.orders table to Snowflake’s tpcc.tpcc schema:

export AWS_ACCESS_KEY_ID=<ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<SECRET_KEY>
export AWS_SESSION_TOKEN=<SESSION_TOKEN>  # Optional

./tidb2dw snowflake \
    --storage s3://my-demo-bucket/dump20230628 \
    --table test.orders \
    --snowflake.account-id .... \
    --snowflake.user .... \
    --snowflake.pass .... \
    --snowflake.database tpcc \
    --snowflake.schema tpcc

Note: Additional parameters such as --cdc.host, --tidb.host, --tidb.user, and --tidb.pass, might be necessary. For further information, use the --help option.

If you want to replicate multiple tables, just start multiple tidb2dw instances with different table names specified.

To maintain continuous loading of incremental data, tidb2dw should remain running. If tidb2dw restarts, the replication process resumes.

Step 3: Verify replication

After the replication operation, it’s recommended that you verify it. To continue with the example in Step 1, we can verify the operation by:

Referring to the log. If you see the log entry “Successfully replicated snapshot from TiDB to Snowflake“, it means that the snapshot replication is finished, and tidb2dw is continuously replicating incremental data.
Querying the data in Snowflake:

How tidb2dw replicates data from TiDB to Snowflake

After the instant experience, you may be wondering how tidb2dw works. In this section, let’s dive a little deeper into this tool. tidb2dw’s replication work consists of three objects: snapshot data, incremental data, and Data Definition Language (DDL).

Replicate snapshot data

The snapshot data refers to the existing data in the TiDB table. tidb2dw replicates snapshot data in two steps:

Export the TiDB table data into CSV files using Dumpling, TiDB’s data export tool. For example:

$ tiup dumpling -u root -P <tidb_port> -h <tidb_host> --filetype csv \
    -t <thread_concurrency> -o s3://<bucket>/dump \
    --s3.region <s3_region>
    --no-header

Load the CSV data into a Snowflake table via the COPY INTO command:

CREATE STORAGE INTEGRATION my_s3     TYPE = EXTERNAL_STAGE     STORAGE_PROVIDER = 'S3'     ENABLED = TRUE     STORAGE_AWS_ROLE_ARN = ''     STORAGE_ALLOWED_LOCATIONS = ('s3:///');      CREATE OR REPLACE FILE FORMAT my_csv_format   TYPE = CSV   EMPTY_FIELD_AS_NULL = FALSE   NULL_IF=('\N')   FIELD_OPTIONALLY_ENCLOSED_BY='"';    CREATE STAGE my_s3_stage   STORAGE_INTEGRATION = my_s3   URL = 's3:///dump'   FILE_FORMAT = my_csv_format;    COPY INTO "stock"   FROM '@tpcc.tpcc.my_s3_stage/tpcc.stock.000000000.csv'   ON_ERROR = CONTINUE;

Replicate incremental data

Data written after the snapshot timestamp is called Incremental data. tidb2dw replicates incremental data in two steps:

TiCDC, the change data capture tool of TiDB, tracks any changes (INSERT, UPDATE or DELETE) and writes the changed rows into S3.
```
$ tiup cdc cli changefeed create \
    --server=http://: \
    --sink-uri="s3:///cdc?protocol=csv&flush-interval=5m&file-size=268435456" \
    --start-ts=
```
By using the parameters above, TiCDC writes the changes into a CSV file in S3 in a tunable interval, or immediately when the change size is larger than 256MB. Usually, a shorter flush interval requires more compute resources in Snowflake.
Apply these changes in the corresponding Snowflake table using Snowflake’s MERGE INTO command. In the CSV file, the first field in a TiCDC CSV file indicates the operation of the row: U for UPDATE, I for INSERT, and D for DELETE. This allows tidb2dw to easily determine which SQL statements to execute in Snowflake.

Type	Table	Database	PK	PK	Col…
I	chbenchmark	customer	1797	2	1403
U	chbenchmark	customer	1638	10	1403
D	chbenchmark	customer	2428	1	1403
U	chbenchmark	customer	1259	8	131

The following example is a Snowflake SQL statement that merges a CSV into a Snowflake table:

MERGE INTO "stock" AS T USING
(
  SELECT 
  $1 AS "METADATA$FLAG"
  , $5 AS "s_i_id"
  , $6 AS "s_w_id"
  , $7 AS "s_quantity"
  , $8 AS "s_dist_01"
   …
  , $20 AS "s_remote_cnt"
  , $21 AS "s_data"
  FROM '@"TPCC"."TPCC"."MY_S3_STAGE_STOCK"/stock/441876501249130510/CDC00000000000000000001.csv'
  QUALIFY row_number() over (partition by "s_w_id", "s_i_id" order by $4 desc) = 1 -- get the last dml events of the same key
) AS S
ON
(
  T."s_w_id" = S."s_w_id" AND T."s_i_id" = S."s_i_id"
)
WHEN MATCHED AND S.METADATA$FLAG = 'U' THEN 
  UPDATE SET "s_quantity" = S."s_quantity", "s_dist_01" = S."s_dist_01", …, "s_remote_cnt" = S."s_remote_cnt", "s_data" = S."s_data"
WHEN MATCHED AND S.METADATA$FLAG = 'D' THEN
  DELETE
WHEN NOT MATCHED AND S.METADATA$FLAG != 'D' THEN
  INSERT ("s_i_id", "s_w_id", "s_quantity", "s_dist_01", …, "s_remote_cnt", "s_data")
  VALUES (S."s_i_id", S."s_w_id", S."s_quantity", S."s_dist_01", …, S."s_remote_cnt", S."s_data")
;

Replicate DDL

tidb2dw also supports replicating DDLs to downstreams in the steps below:

If the TiDB table structure is changed, TiCDC logs it in schema files. The following DDL statement initiated column renaming for the specified table.
```
mysql> alter table customer rename column c_id to c_id_tmp;
Query OK, 0 rows affected (0.08 sec)
```
TiCDC will then generate one data file and two schema files as shown below:

Apply the change in the Snowflake. This involves comparing the difference of the schema files to identify the change and apply it in Snowflake with the relevant DDL command.

Diff
     "TableColumns": [
         {
             "ColumnId": "1",
-            "ColumnName": "c_id",
+            "ColumnName": "c_id_tmp",
             "ColumnType": "INT",
             "ColumnPrecision": "11",
             "ColumnNullable": "false", 


SQL
ALTER TABLE customer RENAME COLUMN c_id TO c_id_tmp;

Known limitations of tidb2dw

tidb2dw offers support for a range of DDLs that alter the table’s schema (excluding those related to indexes). These include:

Adding columns
Dropping columns
Renaming columns
Modifying column types
Dropping tables
Truncating tables

However, users must be aware of the following limitations and caveats:

Partitioned tables: Snowflake doesn’t handle partitioned tables. In the eyes of tidb2dw, a table with multiple partitions is treated as a standard table.
Column type modifications: While TiDB may have flexible operations on modifying column types, Snowflake poses numerous restrictions. For example, Snowflake doesn’t permit updating the default value of a column. Please consult the Snowflake documentation for a comprehensive list of such constraints.
DDL-DML dependency: Before executing any DDL, ensure at least one DML has been executed; failing to do this can result in errors.

Conclusion

With tidb2dw, you can effortlessly replicate data from your TiDB cluster to Snowflake or other 3rd party data warehouses. This is just the beginning. Our engineering team is moving fast to bring even more features to tidb2dw, including:

Support for more warehouses such as Redshift and BigQuery.
Available as a cloud service: To further simplify the user experience, we plan to make tidb2dw available on TiDB Dedicated, our fully-managed cloud DBaaS for predictable workloads You can then deploy tidb2dw with just a few clicks.

Do note that tidb2dw is in its early stages of development. We appreciate the community’s patience and feedback. However, if you stumble upon any issues, please don’t hesitate to open an issue or file a PR. The journey ahead is promising—stay tuned!

Book a Demo