## Subquery Optimization in TiDB

## Introduction to subqueries

Subquery is a query within another SQL query. A common subquery is embedded within the `FROM`

clause, for example：

`SELECT ID FROM (SELECT * FROM SRC) AS T`

The subexpressions in the `FROM`

clauses can be processed very well by the general SQL optimizers. But when it comes to subqueries in the `WHERE`

clause or the `SELECT`

lists, it becomes very difficult to optimize because subqueries can be anywhere in the expression, e.g. in the `CASE...WHEN...`

clauses.

The subqueries that are not in the `FROM`

clause are categorized as "correlated subquery" and "uncorrelated subquery". Correlated subquery refers to a subquery with columns from outer references, for example:

```
SELECT * FROM SRC WHERE
EXISTS(SELECT * FROM TMP WHERE TMP.id = SRC.id)
```

Uncorrelated subqueries can be pre-processed in the plan phase and be re-written to a constant. Therefore, this article is mainly focused on the optimization of correlated subqueries.

Generally speaking, there are following three types of subqueries:

- Scalar Subquery like (
`SELECT...`

) + (`SELECT...`

) - Quantified Comparison like
`T.a = ANY(SELECT...)`

- Existential Test like
`NOT EXISTS(SELECT...)`

,`T.a IN (SELECT...)`

For the simple subqueries like Existential Test, the common practice is to rewrite them to `SemiJoin`

. But it is barely explored in the literature about the generic algorithm and what kind of subqueries need to remove the correlation. For those subqueries whose correlation cannot be removed, the common practice in databases is to execute in Nested Loop, which is called correlated execution.

TiDB inherits the subquery strategy in SQL Server [1]. It introduces the `Apply`

operator to use algebraic representation for subqueries which is called normalization, and then removes the correlation based on the cost information.

## The `Apply`

operator

The reason why subqueries are difficult to optimize is that a subquery cannot be represented as a logic operator like `Projection`

or `Join`

, which makes it difficult to find a generic algorithm for subquery transformation. So the first thing is to introduce a logical operation that can represent the subqueries: the `Apply`

operator, which is also called `d-Join`

[2].
The semantics of the `Apply`

operator is:

\[ R\ A^{\otimes}\ E = \bigcup\limits_{r\in R} (\{r\}\otimes E(r)) \]

where `E`

represents a parameterized subquery. In every execution, the `Apply`

operator gets an `r`

record from the `R`

relation and sends `r`

to `E`

as a parameter for the ⊗ operation of `r`

and `E(r)`

. ⊗ is different based on different query types, usually it's `SemiJoin`

`∃`

.

For the following SQL statement:

```
SELECT * FROM SRC WHERE
EXISTS(SELECT * FROM TMP WHERE TMP.id = SRC.id)
```

the `Apply`

operator representation is as follows:

Because the operator above `Apply`

is `Selection`

, formally, it is:

\[ {SRC}\ A^\exists\ \sigma_{SRC.id=TMP.id}{TMP} \]

For the `EXISTS`

subquery in the `SELECT`

list, and the data that cannot pass through the `SRC.id=TMP.id`

equation, the output should be false. So `OuterJoin`

should be used:

\[ \pi_C({SRC}\ A^{LOJ}\ \sigma_{SRC.id=TMP.id}{TMP}) \]

The `C`

Projection is to transform NULL to false. But the more common practice is: If the output of the `Apply`

operator is directly used by the query predicate, it is converted to `SemiJoin`

.

## Removing the correlation

The introduction of the `Apply`

operator enables us to remove the correlation of the subqueries. The two examples in the previous section can be transformed to:

\[ {SRC}\ \exists_{\sigma_{SRC.id = TMP.id}}\ {TMP} \]

and

\[ {SRC}\ LOJ_{\sigma_{SRC.id = TMP.id}}\ {TMP} \]

Other rules to remove correlation can be formally represented as:

\(R\ A^{\otimes} E= R\ {\otimes}_{true}\ E\), if no parameters in `E`

resolved from `R`

(1)

\(R\ A^{\otimes} (\sigma_pE) = R\ {\otimes}_p\ E\), if no parameters in `E`

resolved from `R`

(2)

\(R\ A^\times\ (\sigma_pE)=\sigma_p(R\ A^\times\ E) \) (3)

\(R\ A^\times\ (\pi_vE) = \pi_{v\bigcup\mathrm{cols}(R)}(R\ A^\times\ E) \) (4)

\(R\ A^\times\ (E_1\ \bigcup\ E_2) = (R\ A^\times\ E_1)\ \bigcup\ (R\ A^\times\ E_2) \) (5)

\(R\ A^\times\ (E_1\ - \ E_2) = (R\ A^\times\ E_1)\ - \ (R\ A^\times\ E_2) \) (6)

\(R\ A^\times\ (E_1\ \times \ E_2) = (R\ A^\times\ E_1)\ \Join_{R.key}\ (R\ A^\times\ E_2) \) (7)

\(R\ A^\times\ (\mathcal{G}_{A,F}E) = \mathcal{G}_{A\bigcup \mathrm{attr}(R),F} (R\ A^{\times}\ E) \) (8)

\(R\ A^\times\ (\mathcal{G}^1_FE) = \mathcal{G}_{A\bigcup \mathrm{attr}(R),F'} (R\ A^{LOJ}\ E) \) (9)

Based on the above rules, the correlation among all the SQL subqueries can be removed [3]. But the (5), (6), and (7) rules are seldom used because the the query cost is increased as a result of the rules about common expression. Take the following SQL statement as an example:

```
SELECT C_CUSTKEY
FROM CUSTOMER WHERE 1000000 <
(SELECT SUM(O_TOTALPRICE)
FROM ORDER WHERE O_CUSTKEY = C_CUSTKEY)
```

The two “CUSTKEY”s are the primary keys. When the statement is transformed to `Apply`

, it is represented as:

\[ \sigma_{1000000<X}(CUSTOMER\ A^\times\ \mathcal{G}^1_{X=SUM(O\_PRICE)}(\sigma_{O\_CUSTKEY=C\_CUSTKEY}ORDERS)) \]

Because of the primary keys, according to rule (9), it can be transformed to the following:

\[ \sigma_{1000000<X}\ \mathcal{G}_{C\_CUSTKEY,X = SUM(O\_PRICE)}(CUSTOMER\ A^{LOJ}\ \sigma_{O\_CUSTKEY=C\_CUSTKEY}ORDERS) \]

**Note:**

- If there are no primary keys in
`ORDERS`

, the \(\pi\) operator should be added to allocate a unique key. - Pay attention to the difference between rule (8) and rule (9). For the \(\mathcal{G}^1_F\) aggregation function without the aggregation column, when the input is NULL, the output should be the default value of the
`F`

aggregation function. Therefore, the`LeftOuterJoin`

should be used and a NULL record should be the output when the right table is NULL. In this case, based on rule (2),`Apply`

can be completely removed. The statement can be transformed to a SQL statement with join:

\[ \sigma_{1000000<X}\mathcal{G}_{C\_CUSTKEY,X=SUM(O\_PRICE)}(CUSTOMER\ LOJ_{O\_CUSTKEY=C\_CUSTKEY}ORDERS) \]

Furthermore, based on the simplification of `OuterJoin`

, the statement can be simplified to:

\[ \sigma_{1000000<X}\mathcal{G}_{C\_CUSTKEY,X=SUM(O\_PRICE)}(CUSTOMER\ \Join_{O\_CUSTKEY=C\_CUSTKEY}ORDERS) \]

Theoretically, the above 9 rules have solved the correlation removal problem. But is correlation removal the best solution for all the scenarios? The answer is no. If the results of the SQL statement are small and the subquery can use the index, then the best solution is to use correlated execution. The `Apply`

operator can be optimized to `Segment Apply`

, which is to sort the data of the outer table according to the correlated key. In this case, the keys that are within one group won't have to be executed multiple times. Of course, this is strongly related to the number of distinct values (NDV) of the correlated keys in the outer table. Therefore, the decision about whether to use correlation removal also depends on statistics. When it comes to this point, the regular optimizer is no longer applicable. Only the optimizer with the Volcano or Cascade Style can take both the logic equivalence rules and the cost-based optimization into consideration. Therefore, a perfect solution for subquery depends on an excellent optimizer framework.

## Aggregation and subquery

In the previous section, the final statement is not completely optimized. The aggregation function above `OuterJoin`

and `InnerJoin`

can be pushed down[4]. If `OutJoin`

cannot be simplified, the formal representation of the push-down rule is:

\[ \mathcal{G_{A,F}}(S\ LOJ_p\ R)=\pi_C(S\ LOJ_p(\mathcal{G}_{A-attr(S),F}R)) \]

The \(\pi_C\) above `Join`

is to convert NULL to the default value when the aggregation function accepts empty values. It is worth mentioning that the above formula can be applied only when the following three conditions are met:

- All the columns that are related to
`R`

within the`p`

predicate are in the`Group by`

column. - The key of the
`S`

relation is in the`Group by`

column. - The aggregations in the \(\mathcal{G}\) function only uses the columns in
`R`

.

It is very common to use aggregation functions together with subqueries. The general solution is to use the formal representation of `Apply`

, and remove the correlation based on the rules, then apply the push-down rules of the aggregation function for further optimization.

## References

[1] C. Galindo-Legaria and M. Joshi. “Orthogonal optimization of subqueries and aggregation”. In: *Proc. of the ACM SIGMOD Conf. on Management of Data (2001)*, pp. 571–581.

[2] D. Maier, Q. Wang and L. Shapiro. *Algebraic unnesting of nested object queries*. Tech. rep. CSE-99-013. Oregon Graduate Institute, 1999.

[3] C. A. Galindo-Legaria. *Parameterized queries and nesting equivalences*. Tech. rep. MSR-TR-2000-31. Microsoft, 2001.

[4] W. Yan and P.-A. Larson. “Eager aggregation and lazy aggregation”. In: *Proc. Int. Conf. on Very Large Data Bases (VLDB)* (1995), pp. 345–357.