Welcome back. In the previous video, I compared query optimization to GPS navigation and an address to being the same as an SQL query. To extend that analogy just a little bit further: a GPS navigation system may have multiple routes to navigate you to the same address. In query optimization parlance, we would say that there are multiple possible execution plans for a query. The database system chooses what it determines to be the most efficient, or lowest cost plan. And we can see what that is with the EXPLAIN command.

TiDB and MySQL both have query optimizers, so coming from a MySQL background this may not be new. But both the format of EXPLAIN, and the types of optimizations (or routes possible) in TiDB do differ from MySQL. TiDB is also a distributed system, so in determining the lowest cost plan it needs to estimate the CPU cost, memory cost and network transfer cost.

So let's look at the current best plan for this query. To do that, we just add the word EXPLAIN, so EXPLAIN SELECT count(*) FROM trips WHERE start_date BETWEEN '2017-07-01 00:00:00' AND '2017-07-01 23:59:59'.

MySQL [bikeshare]> EXPLAIN SELECT count(*) FROM trips WHERE start_date BETWEEN '2017-07-01 00:00:00' AND '2017-07-01 23:59:59';
| id                       | count      | task | operator info                                                                                                          |
| StreamAgg_20             | 1.00       | root | funcs:count(col_0)                                                                                                     |
| └─TableReader_21         | 1.00       | root | data:StreamAgg_9                                                                                                       |
|   └─StreamAgg_9          | 1.00       | cop  | funcs:count(1)                                                                                                         |
|     └─Selection_19       | 40444.43   | cop  | ge(bikeshare.trips.start_date, 2017-07-01 00:00:00.000000), le(bikeshare.trips.start_date, 2017-07-01 23:59:59.000000) |
|       └─TableScan_18     | 1617777.00 | cop  | table:trips, range:[-inf,+inf], keep order:false                                                                       |
5 rows in set (0.00 sec)

Here we can see that the coprocessor (cop) needs to scan the table trips to find rows that match the criteria of start_date. Coprocessor is just a way of saying “tasks that happen in TiKV, not TiDB”. We will cover what that means soon, but for performance the TiDB Platform pushes parts of query execution into TiKV. This helps reduce the amount of data that needs to be copied around.

Rows that meet this criteria are determined in Selection_19 and passed to StreamAgg_9, all still within the coprocessor (i.e. all still inside of TiKV). The count column shows an approximate number of rows that will be processed, which is estimated with the help of table statistics. In this query, it is estimated that each of the TiKV nodes will return 1.00 row to TiDB (as TableReader_21), which are then aggregated as StreamAgg_20 to return an estimated 1.00 row to the client.

The good news with this query is that most of the work is pushed down to the coprocessor. This means that minimal data transfer way required for query execution. However, what reading EXPLAIN has told me is that there is no index to speed up filtering the predicate start_date. We can see in TableScan_18 that all the rows in the table are being scanned. Ouch!

So let's fix that with an index, which takes about 7 minutes. So I'll fast forward:

MySQL [bikeshare]>  ALTER TABLE trips ADD INDEX (start_date);
Query OK, 0 rows affected (6 min 51.40 sec)
MySQL [bikeshare]> EXPLAIN SELECT count(*) FROM trips WHERE start_date BETWEEN '2017-07-01 00:00:00' AND '2017-07-01 23:59:59';
| id                     | count    | task | operator info                                                                                    |
| StreamAgg_25           | 1.00     | root | funcs:count(col_0)                                                                               |
| └─IndexReader_26       | 1.00     | root | index:StreamAgg_9                                                                                |
|   └─StreamAgg_9        | 1.00     | cop  | funcs:count(1)                                                                                   |
|     └─IndexScan_24     | 40444.43 | cop  | table:trips, index:start_date, range:[2017-07-01 00:00:00,2017-07-01 23:59:59], keep order:false |
4 rows in set (0.71 sec)

In the revisited EXPLAIN, we can see the count of rows scanned has reduced via the use of an index. On my Kubernetes cluster, this reduced query execution time reduced from 3.38 seconds to 0.02 seconds!

So a quick rundown of what's in EXPLAIN:

  • It shows the plan that the query optimization determined to be the cheapest.
  • Each of the rows is a set of steps that need to be completed, represented in a tree.
  • Each of the counts are estimates; they are built by keeping statistics on the data distribution such as histograms.
  • Running EXPLAIN doesn't run the actual query.
  • Like with GPS, it's possible that statistics are incorrect and there might be cases where they need updating.
  • Since you can't sight check a query (it's declarative), your job in running EXPLAIN is to make sure that indexes are being used to match various predicates (in this case an index on start_date makes sense!) Try and think WHERE work can be eliminated or filtered out - this is the first place to start.

It's time to try optimizing a few queries for yourself in our next exercise.