Introduction to subqueries
Subquery is a query within another SQL query. A common subquery is embedded within the FROM
clause, for example:
SELECT ID FROM (SELECT * FROM SRC) AS T
The subexpressions in the FROM
clauses can be processed very well by the general SQL optimizers. But when it comes to subqueries in the WHERE
clause or the SELECT
lists, it becomes very difficult to optimize because subqueries can be anywhere in the expression, e.g. in the CASE...WHEN...
clauses.
The subqueries that are not in the FROM
clause are categorized as “correlated subquery” and “uncorrelated subquery”. Correlated subquery refers to a subquery with columns from outer references, for example:
SELECT * FROM SRC WHERE
EXISTS(SELECT * FROM TMP WHERE TMP.id = SRC.id)
Uncorrelated subqueries can be pre-processed in the plan phase and be re-written to a constant. Therefore, this article is mainly focused on the optimization of correlated subqueries.
Generally speaking, there are following three types of subqueries:
- Scalar Subquery like (
SELECT...
) + (SELECT...
) - Quantified Comparison like
T.a = ANY(SELECT...)
- Existential Test like
NOT EXISTS(SELECT...)
,T.a IN (SELECT...)
For the simple subqueries like Existential Test, the common practice is to rewrite them to SemiJoin
. But it is barely explored in the literature about the generic algorithm and what kind of subqueries need to remove the correlation. For those subqueries whose correlation cannot be removed, the common practice in databases is to execute in Nested Loop, which is called correlated execution.
TiDB inherits the subquery strategy in SQL Server [^1]. It introduces the Apply
operator to use algebraic representation for subqueries which is called normalization, and then removes the correlation based on the cost information.
The Apply
operator
The reason why subqueries are difficult to optimize is that a subquery cannot be represented as a logic operator like Projection
or Join
, which makes it difficult to find a generic algorithm for subquery transformation. So the first thing is to introduce a logical operation that can represent the subqueries: the Apply
operator, which is also called d-Join
[^2].
The semantics of the Apply
operator is:
$$
R A^{otimes} E = bigcuplimits_{rin R} ({r}otimes E(r))
$$
where E
represents a parameterized subquery. In every execution, the Apply
operator gets an r
record from the R
relation and sends r
to E
as a parameter for the ⊗ operation of r
and E(r)
. ⊗ is different based on different query types, usually it’s SemiJoin
∃
.
For the following SQL statement:
SELECT * FROM SRC WHERE
EXISTS(SELECT * FROM TMP WHERE TMP.id = SRC.id)
the Apply
operator representation is as follows:
Because the operator above Apply
is Selection
, formally, it is:
$$
SRC A^exists sigma_{SRC.id=TMP.id}TMP
$$
For the EXISTS
subquery in the SELECT
list, and the data that cannot pass through the SRC.id=TMP.id
equation, the output should be false. So OuterJoin
should be used:
$$
pi_C({SRC} A^{LOJ} sigma_{SRC.id=TMP.id}TMP)
$$
The C
Projection is to transform NULL to false. But the more common practice is: If the output of the Apply
operator is directly used by the query predicate, it is converted to SemiJoin
.
Removing the correlation
The introduction of the Apply
operator enables us to remove the correlation of the subqueries. The two examples in the previous section can be transformed to:
$$
SRC exists sigma_{SRC.id = TMP.id} TMP
$$
and
$$
SRC LOJ sigma_{SRC.id = TMP.id} TMP
$$
Other rules to remove correlation can be formally represented as:
$R A^{otimes} E= R {otimes}_{true} E$, if no parameters in E
resolved from R
(1)
$R A^{otimes} (sigma_pE) = R {otimes}_p E$, if no parameters in E
resolved from R
(2)
$R A^times (sigma_pE)=sigma_p(R A^times E)$ (3)
$R A^times (pi_vE) = pi_{vbigcupmathrm{cols}(R)}(R A^times E)$ (4)
$R A^times (E_1 bigcup E_2) = (R A^times E_1) bigcup (R A^times E_2)$ (5)
$R A^times (E_1 – E_2) = (R A^times E_1) – (R A^times E_2)$ (6)
$R A^times (E_1 times E_2) = (R A^times E_1) Join_{R.key} (R A^times E_2)$ (7)
$R A^times (mathcal{G}{A,F}E) = mathcal{G}{Abigcup mathrm{attr}(R),F} (R A^{times} E)$ (8)
$R A^times (mathcal{G}^1_FE) = mathcal{G}_{Abigcup mathrm{attr}(R),F’} (R A^{LOJ} E)$ (9)
Based on the above rules, the correlation among all the SQL subqueries can be removed [^3]. But the (5), (6), and (7) rules are seldom used because the the query cost is increased as a result of the rules about common expression.
Take the following SQL statement as an example:
SELECT C_CUSTKEY
FROM CUSTOMER WHERE 1000000 <
(SELECT SUM(O_TOTALPRICE)
FROM ORDER WHERE O_CUSTKEY = C_CUSTKEY)
The two “CUSTKEY”s are the primary keys. When the statement is transformed to Apply
, it is represented as:
$$
sigma_{1000000<X}(CUSTOMER A^times mathcal{G}^1_{X=SUM(O_PRICE)}(sigma_{O_CUSTKEY=C_CUSTKEY}ORDERS))
$$
Because of the primary keys, according to rule (9), it can be transformed to the following:
$$
sigma_{1000000<X} mathcal{G}{C_CUSTKEY,X = SUM(O_PRICE)}(CUSTOMER A^{LOJ} sigma{O_CUSTKEY=C_CUSTKEY}ORDERS)
$$
Note:
- If there are no primary keys in
ORDERS
, the $pi$ operator should be added to allocate a unique key. - Pay attention to the difference between rule (8) and rule (9). For the $mathcal{G}^1_F$ aggregation function without the aggregation column, when the input is NULL, the output should be the default value of the
F
aggregation function. Therefore, theLeftOuterJoin
should be used and a NULL record should be the output when the right table is NULL. In this case, based on rule (2),Apply
can be completely removed. The statement can be transformed to a SQL statement with join:
$$
sigma_{1000000<X}mathcal{G}{C_CUSTKEY,X=SUM(O_PRICE)}(CUSTOMER LOJ{O_CUSTKEY=C_CUSTKEY}ORDERS)
$$
Furthermore, based on the simplification of OuterJoin
, the statement can be simplified to:
$$
sigma_{1000000<X}mathcal{G}{C_CUSTKEY,X=SUM(O_PRICE)}(CUSTOMER Join{O_CUSTKEY=C_CUSTKEY}ORDERS)
$$
Theoretically, the above 9 rules have solved the correlation removal problem. But is correlation removal the best solution for all the scenarios? The answer is no. If the results of the SQL statement are small and the subquery can use the index, then the best solution is to use correlated execution. The Apply
operator can be optimized to Segment Apply
, which is to sort the data of the outer table according to the correlated key. In this case, the keys that are within one group won’t have to be executed multiple times. Of course, this is strongly related to the number of distinct values (NDV) of the correlated keys in the outer table. Therefore, the decision about whether to use correlation removal also depends on statistics. When it comes to this point, the regular optimizer is no longer applicable. Only the optimizer with the Volcano or Cascade Style can take both the logic equivalence rules and the cost-based optimization into consideration. Therefore, a perfect solution for subquery depends on an excellent optimizer framework.
Aggregation and subquery
In the previous section, the final statement is not completely optimized. The aggregation function above OuterJoin
and InnerJoin
can be pushed down[^4]. If OutJoin
cannot be simplified, the formal representation of the push-down rule is:
$$
mathcal{G_{A,F}}(S LOJ_p R)=pi_C(S LOJ_p(mathcal{G}_{A-attr(S),F}R))
$$
The $pi_C$ above Join
is to convert NULL to the default value when the aggregation function accepts empty values. It is worth mentioning that the above formula can be applied only when the following three conditions are met:
- All the columns that are related to
R
within thep
predicate are in theGroup by
column. - The key of the
S
relation is in theGroup by
column. - The aggregations in the $mathcal{G}$ function only uses the columns in
R
.
It is very common to use aggregation functions together with subqueries. The general solution is to use the formal representation of Apply
, and remove the correlation based on the rules, then apply the push-down rules of the aggregation function for further optimization.
References
[^1]: C. Galindo-Legaria and M. Joshi. “Orthogonal optimization of subqueries and aggregation”. In: Proc. of the ACM SIGMOD Conf. on Management of Data (2001), pp. 571–581.
[^2]: D. Maier, Q. Wang and L. Shapiro. Algebraic unnesting of nested object queries. Tech. rep. CSE-99-013. Oregon Graduate Institute, 1999.
[^3]: C. A. Galindo-Legaria. Parameterized queries and nesting equivalences. Tech. rep. MSR-TR-2000-31. Microsoft, 2001.
[^4]: W. Yan and P.-A. Larson. “Eager aggregation and lazy aggregation”. In: Proc. Int. Conf. on Very Large Data Bases (VLDB) (1995), pp. 345–357.
Experience modern data infrastructure firsthand.
TiDB Cloud Dedicated
A fully-managed cloud DBaaS for predictable workloads
TiDB Cloud Serverless
A fully-managed cloud DBaaS for auto-scaling workloads