The TiDB Connector for Spark is a thin layer built for running Apache Spark on top of TiDB/TiKV to answer the complex OLAP queries. It takes advantages of both the Spark platform and the distributed TiKV cluster and seamlessly glues to TiDB, the distributed OLTP database, to provide a Hybrid Transactional/Analytical Processing (HTAP) solution to serve as a one-stop solution for both online transactions and analysis.
The TiDB Connector for Spark depends on the TiKV cluster and the PD cluster. You also need to set up a Spark cluster. This document provides a brief introduction to how to setup and use the TiDB Connector for Spark. It requires some basic knowledge of Apache Spark. For more information, see Spark website.
The TiDB Connector for Spark is an OLAP solution that runs Spark SQL directly on TiKV, the distributed storage engine.
For independent deployment of TiKV and the TiDB Connector for Spark, it is recommended to refer to the following recommendations
TiKV parameters (default)
[server] end-point-concurrency = 8 # For OLAP scenarios, consider increasing this parameter [raftstore] sync-log = false [rocksdb] max-background-compactions = 6 max-background-flushes = 2 [rocksdb.defaultcf] block-cache-size = "10GB" [rocksdb.writecf] block-cache-size = "4GB" [rocksdb.raftcf] block-cache-size = "1GB" [rocksdb.lockcf] block-cache-size = "1GB" [storage] scheduler-worker-pool-size = 4
See the Spark official website for the detail hardware recommendations.
The following is a short overview of the TiDB Connector for Spark configuration.
It is recommended to allocate 32G memory for Spark. Please reserve at least 25% of the memory for the operating system and buffer cache.
It is recommended to provision at least 8 to 16 cores on per machine for Spark. Initially, you can assign all the CPU cores to Spark.
See the official configuration on the Spark website. The following is an example based on the
SPARK_EXECUTOR_MEMORY = 32g SPARK_WORKER_MEMORY = 32g SPARK_WORKER_CORES = 8
For the hybrid deployment of the TiDB Connector for Spark and TiKV, add the TiDB Connector for Spark required resources to the TiKV reserved resources, and allocate 25% of the memory for the system.
Download the TiDB Connector for Spark’s jar package here.
Running TiDB Connector for Spark on an existing Spark cluster does not require a reboot of the cluster. You can use Spark’s
--jars parameter to introduce the TiDB Connector for Spark as a dependency:
spark-shell --jars $PATH/tispark-0.1.0.jar
If you want to deploy TiDB Connector for Spark as a default component, simply place the TiDB Connector for Spark jar package into the jars path for each node of the Spark cluster and restart the Spark cluster:
In this way, you can use either
Spark-Shell to use the TiDB Connector for Spark directly.
If you do not have a Spark cluster, we recommend using the standalone mode. To use the Spark Standalone model, you can simply place a compiled version of Spark on each node of the cluster. If you encounter problems, see its official website. And you are welcome to file an issue on our GitHub.
You can download Apache Spark
For the Standalone mode without Hadoop support, use Spark 2.1.x and any version of Pre-build with Apache Hadoop 2.x with Hadoop dependencies. If you need to use the Hadoop cluster, please choose the corresponding Hadoop version. You can also choose to build from the source code to match the previous version of the official Hadoop 2.6. Please note that the TiDB Connector for Spark currently only supports Spark 2.1.x version.
Suppose you already have a Spark binaries, and the current PATH is
SPARKPATH, please copy the TiDB Connector for Spark jar package to the
Execute the following command on the selected Spark Master node:
cd $SPARKPATH ./sbin/start-master.sh
After the above step is completed, a log file will be printed on the screen. Check the log file to confirm whether the Spark-Master is started successfully. You can open the http://spark-master-hostname:8080 to view the cluster information (if you does not change the Spark-Master default port number). When you start Spark-Slave, you can also use this panel to confirm whether the Slave is joined to the cluster.
Similarly, you can start a Spark-Slave node with the following command:
After the command returns, you can see if the Slave node is joined to the Spark cluster correctly from the panel as well. Repeat the above command at all Slave nodes. After all Slaves are connected to the master, you have a Standalone mode Spark cluster.
If you want to use JDBC server and interactive SQL shell, please copy
start-tithriftserver.sh stop-tithriftserver.sh to your Spark’s sbin folder and
tispark-sql to the bin folder.
To start interactive shell:
To use Thrift Server, you can start it similar way as default Spark Thrift Server:
And stop it like below:
Assuming that you have successfully started the TiDB Connector for Spark cluster as described above, here’s a quick introduction to how to use Spark SQL for OLAP analysis. Here we use a table named
lineitem in the
tpch database as an example.
Assuming that your PD node is located at
2379, add the following command to
And then enter the following command in the Spark-Shell:
import org.apache.spark.sql.TiContext val ti = new TiContext(spark) ti.tidbMapDatabase ("tpch")
After that you can call Spark SQL directly:
spark.sql("select count(*)from lineitem").show
The result is:
+-------------+ | Count (1) | +-------------+ | 600000000 | +-------------+
TiSpark’s SQL Interactive shell is almost the same as the spark-SQL shell.
tispark-sql> use tpch; Time taken: 0.015 seconds tispark-sql> select count(*) from lineitem; 2000 Time taken: 0.673 seconds, Fetched 1 row(s)
For JDBC connection with Thrift Server, you can try it with various JDBC supported tools including SQuirreLSQL and hive-beeline. For example, to use it with beeline:
./beeline Beeline version 1.2.2 by Apache Hive beeline> !connect jdbc:hive2://localhost:10000 1: jdbc:hive2://localhost:10000> use testdb; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (0.013 seconds) select count(*) from account; +-----------+--+ | count(1) | +-----------+--+ | 1000000 | +-----------+--+ 1 row selected (1.97 seconds)
TiSparkR is a thin layer built to support the R language with TiSpark. Refer to this document for usage.
TiSpark on PySpark is a Python package build to support the Python language with TiSpark. Refer to this document for usage.
Q: What are the pros/cons of independent deployment as opposed to a shared resource with an existing Spark / Hadoop cluster?
A: You can use the existing Spark cluster without a separate deployment, but if the existing cluster is busy, TiDB Connector for Spark will not be able to achieve the desired speed.
Q: Can I mix Spark with TiKV?
A: If TiDB and TiKV are overloaded and run critical online tasks, consider deploying the TiDB Connector for Spark separately. You also need to consider using different NICs to ensure that OLTP’s network resources are not compromised and affect online business. If the online business requirements are not high or the loading is not large enough, you can consider mixing the TiDB Connector for Spark with TiKV deployment.