TiDB + Apache Spark


For those that are confused with the overlap between the TiDB Server and Apache Spark, which can both execute SQL queries, here is a general rule to differentiate the two:

  • TiDB aims to handle 100% of OLTP queries.
  • TiDB aims to handle 80% of adhoc OLAP queries.

Spark has some advantages for very long running queries from its managed execution environment. Jobs can be automatically re-run on failure. Additionally, a job that contains multiple queries in a map-reduce form can be scheduled amongst a pool of workers. If an individual worker fails, the work can be rescheduled on another node. For OLAP queries on TiDB, the TiDB server needs to remain healthy for the duration of query execution, and failures will not be retried.

Over time the engineering team plans to increase the percentage of queries that the TiDB Server can handle directly, while also enhancing Spark support through the TiSpark driver. Why both?

Spark is a unified analytics engine for large-scale data processing. So if we think beyond SQL, there are use cases that involve both machine learning and graphs which Apache Spark is well suited for. By using the TiSpark driver, you can make use of these without having to ETL data between systems.