Runaway Query Management: Enhancing Stability in TiDB

Maintaining database stability is paramount for any business, as unexpected dips in performance for critical systems can cause substantial losses. By streamlining change testing processes or employing new technologies, you can limit unforeseen incidents within a certain range. However, sudden SQL performance issues, such as drastic data volume changes, increasingly complicated queries, and partially validated SQLs, may still take you by surprise occasionally.

These incidents can have severe consequences for latency-sensitive applications that are hard to mitigate. As a solution, TiDB 7.2 introduced runaway query management to manage unexpected queries and address these issues systematically.

What are Runaway Queries?

Runaway queries are queries that exceed expected execution times or resource usage. Runaway query management provides an efficient, controllable, and automated resource management mechanism. By mitigating the adverse effects of unexpected SQL performance issues it enhances TiDB’s stability in complex workload scenarios.

What are the Use Cases of Runaway Query Management?

Runaway query management is useful in situations where:

It’s essential for automatically identifying and addressing abnormal SQL performance issues to preserve the service quality of crucial systems
Sudden SQL performance issues may occur with no immediate and effective fixes available
It’s necessary to blacklist or rate-limit specific SQLs known to have security or performance concerns

How Does Runaway Query Management Work?

Runaway query management’s magic lies in two essential capabilities: identifying and handling queries.

Identifying Runaway Queries

TiDB’s resource control module provides two identification methods: Dynamic Identification and Static Identification.

Dynamic Identification

Dynamic Identification identifies runaway queries automatically based on real-time performance metrics defined in resource groups. Currently, this method uses the EXEC_ELAPSED setting to determine the actual execution time of the SQL commands. If a query takes longer than the limit specified by EXEC_ELAPSED, it is identified as a runaway query.

In the following SQL command, any query in the “default” resource group that takes over 5 seconds to execute will be marked as a runaway query.


ALTER RESOURCE GROUP default 
QUERY_LIMIT=(EXEC_ELAPSED='5s', ACTION=KILL);

Once there is an identified runaway query, you can configure WATCH rules in the resource group to watch for SQL traits of the identified query within a specified time frame. This approach allows for direct recognition of runaway queries based on SQL traits, eliminating the wait for rule-based identification.

In the following example, we added a WATCH rule to monitor queries with similar traits to the identified runaway query within a ten-minute time frame. Therefore, in the next ten minutes, the system will identify runaway queries directly without executing the five-second rule. After the specified duration, if the query performance becomes normal, the watch rule expires.


ALTER RESOURCE GROUP default 
QUERY_LIMIT=(EXEC_ELAPSED='5s', ACTION=KILL, WATCH=SIMILAR DURATION='10m');

Static Identification

Automated rules cannot flag all problematic queries precisely, so we introduced a manual approach – Static Identification – that recognizes queries by distinct SQL traits. Administrators can use the QUERY WATCH command to define rules for identifying and handling specific SQL traits. This essentially creates a blacklist for database queries. The static identification methods include:

SQL Text: This involves exact matches based on the SQL text itself.
SQL Digest: This method matches queries with the same SQL digest, identifying queries with similar structures but minor differences. For example, select c from t1 where a=1 and select c from t1 where a=2 have the same digest values.
Plan Digest: This method matches queries that share identical execution plans, targeting specific plans that are often behind performance issues.

You can collect SQL traits through slow queries. For example:


SELECT count(1)
FROM  sbtest.sbtest1 AS S
,sbtest.sbtest2 AS S2        
,sbtest.sbtest3 AS S3  WHERE S1.c=S2.c AND S1.c=S3.c;
# Time: 2023-09-19T17:16:56.640436+08:00
...
# Digest: d3c7846bb8f6b817ae395db30eadedec57af08f7983466f68db93d9ce1ac5872
...
# Plan_digest: 41fee801f07e06aa4aba4c0142ce4c624e8dc932c9e14d49854b8ce57366b443

You can select one of the recognition methods based on your experience. The following example below uses the SQL DIGEST clause to add similar queries to the monitoring queue.


mysql> QUERY WATCH ADD ACTION KILL SQL DIGEST 'd3c7846bb8f6b817ae395db30eadedec57af08f7983466f68db93d9ce1ac5872';
Query OK, 0 rows affected (0.01 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.RUNAWAY_WATCHES ORDER BY id\G
*************************** 1. row ***************************
                ID: 54
RESOURCE_GROUP_NAME: default
        START_TIME: 2023-09-20 01:59:14
          END_TIME: UNLIMITED
              WATCH: Similar
        WATCH_TEXT: d3c7846bb8f6b817ae395db30eadedec57af08f7983466f68db93d9ce1ac5872
            SOURCE: manual
            ACTION: Kill
1 row in set (0.04 sec)

Handling Runaway Queries

Handling refers to the methods to manage identified runaway queries. The supported methods include:

DRYRUN: Identifies issues without any action and displays the findings in logs and views. It’s useful for initial testing to detect false positives.
COOLDOWN: Reduces the priority of the identified query within the resource group, slowing its processing.
KILL: Terminates the identified query to protect database performance.

In TiDB 7.5, COOLDOWN in complex situations is limited. We recommend using KILL to maintain high service quality.

In the following sample scenario, runaway queries are automatically terminated upon detection.


ALTER RESOURCE GROUP default QUERY_LIMIT=(EXEC_ELAPSED='5s', ACTION=KILL, WATCH=SIMILAR DURATION='10m');

Logging and Observability

TiDB provides a series of system tables to log and access all previously mentioned settings and historical data related to the identification and management of issues:

INFORMATION_SCHEMA.RESOURCE_GROUPS: This table outlines the definitions of resource groups, including the rules for runaway query identification and their corresponding handling settings.
INFORMATION_SCHEMA.RUNAWAY_WATCHES: This table lists the rules established in the monitoring queue.
MYSQL.TIDB_RUNAWAY_QUERIES: This table maintains a log of the historical instances of runaway queries detected and addressed.

Runaway Query Management in Action

Let’s look at how Runaway Query Management works in the following simulated scenarios:

Phase	Workload	Resource Group Setting	QPS	P99
1	Normal	N/A	11K	50ms
2	Normal + Continuous highly-consuming query	N/A	3K	200ms
3	Normal + Continuous highly-consuming query	QUERY_LIMIT=(EXEC_ELAPSED=’1s’, ACTION=KILL)	7.5K	70ms ~ 80ms
4	Normal + Continuous highly-consuming query	QUERY_LIMIT=(EXEC_ELAPSED=’1s’, ACTION=KILL, WATCH=EXACT DURATION=’5m’)	11K	50ms

Here is what happened through the phases:

As the initial state, phase 1 started with a normal load scenario. The overall QPS is nearly 11k, with a P999 latency of approximately 50ms.
As we submitted an anomalous query in phase 2 once every second for 3 to 8 seconds, there was a drastic QPS reduction from 11k to around 3k and an increase in P999 latency from 60ms to 200ms.

To address this issue, we implemented a runaway query identification rule in the default resource group to terminate queries exceeding a 1-second execution time. This adjustment improves the QPS to 7.5k, and the P999 latency decreases.


mysql> alter resource group default QUERY_LIMIT=(EXEC_ELAPSED='1s', ACTION=KILL);

To observe the status, you can query the MYSQL.TIDB_RUNAWAY_QUERIES system table. As shown below, the runaway query management system started to involve actively and consistently identifying and handling problematic SQL queries.


mysql> select * from mysql.tidb_runaway_queries limit 1 \G
*************************** 1. row ***********************
resource_group_name: default
               time: 2023-09-19 15:18:10
         match_type: identify
             action: kill
       original_sql: SELECT count(1)
  FROM  sbtest.sbtest1 AS S1
       ,sbtest.sbtest2 AS S2
       ,sbtest.sbtest3 AS S3
 WHERE S1.c=S2.c
    AND S1.c=S3.c
        plan_digest: 41fee801f07e06aa4aba4c0142ce4c624e8dc932c9e14d49854b8ce57366b443
        tidb_server: 127.0.0.1:4000


mysql> select count(*) from mysql.tidb_runaway_queries;
+----------+
| count(*) |
+----------+
|       56 |
+----------+
1 row in set (0.02 sec)

However, the QPS hadn’t recovered to its original state because, even though the queries lasting over one second were terminated, they still occupied the system for that duration.

In phase 4, we added the Watch rule in the resource group to include the texts of queries that match the runaway criteria in the monitoring list for five minutes. Consequently, queries matching the runaway criteria are instantly terminated, eliminating the one-second wait. TiDB will reassess the query performance after 5 minutes and remove restrictions If the performance is restored. At this time, the system’s QPS and P999 will revert to their initial stage levels.

 mysql> alter resource group default QUERY_LIMIT=(EXEC_ELAPSED='1s', ACTION=KILL, WATCH=EXACT DURATION='5m');

We can get the WATCH rule by querying the INFORMATION_SCHEMA.RUNAWAY_WATCHES table:


mysql> SELECT * FROM INFORMATION_SCHEMA.RUNAWAY_WATCHES ORDER BY id\G
***************** 1. row ****************
                 ID: 50
RESOURCE_GROUP_NAME: default
         START_TIME: 2023-09-19 16:58:20
           END_TIME: 2023-09-19 17:03:20
              WATCH: Exact
         WATCH_TEXT: SELECT count(1)
  FROM  sbtest.sbtest1 AS S1
       ,sbtest.sbtest2 AS S2
       ,sbtest.sbtest3 AS S3
 WHERE S1.c=S2.c
    AND S1.c=S3.c
             SOURCE: 127.0.0.1:4000
             ACTION: Kill
1 row in set (0.01 sec)

So far, you should have a basic understanding of how runaway query nanagement limits resource consumption of individual SQL queries and mitigates their impact on the overall performance.

NOTE: In the given demos above, we can address performance issues by identifying problematic queries using “slow logs” or system tables even if resource groups don’t automatically detect queries. These queries can then be manually added to a monitoring list with QUERY WATCH, effectively creating a blacklist.

Conclusion

A key benefit of TiDB’s runaway query management lies in improved user experience.

Users can effortlessly monitor and manage runaway queries within the database through automated and manual approaches, thus minimizing their interference with regular business operations.

Moving forward, we will continue to iterate this feature with more sophisticated identification rules, diverse processing strategies, and enhanced observability. In addition, we plan to introduce graphical management to elevate the user experience further, advancing TiDB as a top-tier enterprise database platform.

You can try runaway query management by downloading our latest production-ready release, TiDB 7.5. And don’t forget to join the TiDB Slack Community for interactions with other TiDB practitioners, as well as real-time discussions with our engineering teams.

Book a Demo