HTAP Summit 2024 session replays are now live!Access Session Replays
tidb_feature_1800x600 3

In today’s data-driven world, efficiently handling complex queries on large datasets is critical to keeping applications responsive and performant. For TiDB, a distributed SQL database designed to manage high-scale, high-demand environments, optimizing data access paths is essential to delivering smooth, efficient queries. This post dives into TiDB’s advanced approach to optimizing multi-column indexes. This is a powerful feature that significantly enhances the speed and accuracy of data retrieval across distributed clusters.

TiDB’s query optimizer leverages multi-column indexes to intelligently filter data, handling complex query conditions that traditional databases like MySQL can’t process as effectively. Here, we’ll walk through how multi-column indexes function, why they’re crucial, and how TiDB’s optimization transforms intricate query conditions into efficient access paths. The impact? Faster responses, minimized table scans, and streamlined performance—even at massive scale.

Without these optimizations, query performance in large TiDB databases can degrade quickly. Full table scans and inadequate filtering can turn milliseconds into minutes. Additionally, excessive memory use can lead to out-of-memory (OOM) errors, especially in constrained environments. TiDB’s targeted approach ensures only relevant data is accessed. This keeps latency low and memory usage efficient—even for the most complex queries. Let’s explore how TiDB makes it happen.

Background: Multi-Column Indexes

Indexes are a powerful tool for improving query performance by avoiding the need to scan all rows in a table. Let’s take an example of a rental listings table defined as follows. In this example, each listing contains a unique ID, city, number of bedrooms, rent price, and availability date:

CREATE TABLE listings (
    listing_id INT PRIMARY KEY AUTO_INCREMENT,
    city VARCHAR(100) NOT NULL,
    bedrooms INT NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    availability_date DATE NOT NULL
);

Suppose this table has 20 million listings across the United States. If we want to find all listings with a price under $2,000, we could add an index on the price column. This index would allow the optimizer to filter out rows, scanning only the range [ -inf, 2000.00 ). This helps reduce the search to about 14 million rows (assuming 70% of rentals are priced above $2,000). In the query execution plan, TiDB would perform an index range scan on price. This limits the need for a full table scan and improves efficiency.

"Q1" 
explain FORMAT=BRIEF
    SELECT * FROM listings WHERE price < 2000;
-----+------------------------------------------------------+--------------------------
| id  task                | access object                        | operator info   
+-----------------------------+---------+-----------+-----------------------------------
| IndexLookUp             | root                                 | 
| ├─IndexRangeScan(Build) | table:listings,index:price_idx(price)| range:[-inf,2000.00) 
| └─TableRowIDScan(Probe) | table:listings                       | 
+-----------------------------+---------+-----------+-----------------------------------

While this filter improves performance, it may still return a large number of rows. This isn’t ideal for a user looking for more specific listings. Adding filters, such as specifying the city, number of bedrooms, and a maximum price, narrows the results significantly. For example, a query to find two-bedroom listings in San Francisco under $2,000 is more useful, likely returning only a few dozen rows.

To optimize this query, we can create a multi-column index on city, bedrooms, and price:

CREATE INDEX idx_city_bedrooms_price ON listings (city, bedrooms, price);

Multi-column indexes in SQL are lexicographically ordered, meaning they are sorted first by city, then by bedrooms within each city, and finally by price within each city-bedroom combination. This ordering allows TiDB to efficiently access rows based on each condition:

  1. Filtering by city as the primary filter.
  2. Optionally filtering by bedrooms within that city.
  3. Optionally filtering by price within the city-bedroom grouping.

Sample Data

Let’s look at a sample dataset to illustrate how multi-column indexing refines search results:

CityBedroomsPrice
San Diego11000
San Diego11500
San Diego21000
San Diego22500
San Diego31000
San Diego32500
San Francisco11000
San Francisco11500
San Francisco21000
San Francisco21500
San Francisco32500
San Francisco33000

Optimized Query and Result

Using the multi-column index, TiDB can efficiently narrow the range to find listings in San Francisco with two bedrooms and a price under $2,000:

"Q2"

EXPLAIN FORMAT=BRIEF 
    SELECT * FROM listings 
    WHERE city = 'San Francisco' AND bedrooms = 2 AND price < 2000;
"Q2" plan. 
explain format=brief 
    select * from listings 
    where city = 'San Francisco' and bedrooms = 2 and price < 2000;
-----+------------------------------------------------------+--------------------------
| id  task                | access object                | operator info   
+-----------------------------+---------+-----------+-----------------------------------
| IndexLookUp             | root                         | 
| ├─IndexRangeScan(Build) | table:listings,              | range:
                          | index:idx_city_bedrooms_price| ["San Francisco" 2 -inf,
                          | (city, bedrooms, price)      | "San Francisco" 2 2000.00)
| └─TableRowIDScan(Probe) | table:listings               | 
+-----------------------------+---------+-----------+-----------------------------------

This query will return the following filtered results from the sample data:

CityBedroomsPrice
San Francisco21000
San Francisco21500

By using a multi-column index, TiDB avoids unnecessary row scanning and significantly boosts query performance.

Index Range Derivation

The TiDB optimizer includes a powerful range derivation component. It’s designed to take a query’s conditions and relevant index columns and generate efficient index ranges for table access. This derived range then feeds into TiDB’s table access component, which determines the most resource-efficient way to access the table.

For each table in a query, the table access component evaluates all applicable indexes to identify the optimal access method—whether through a full table scan or an index scan. It calculates the range for each relevant index, assesses the access cost, and selects the path with the lowest cost. This process combines range derivation with a cost assessment subsystem to find the most efficient way to retrieve data, balancing performance and resource usage.

The diagram below illustrates how the range derivation and cost assessment work together within TiDB’s table access logic to achieve optimal data retrieval.

How range derivation and cost assessment work together within TiDB’s table access logic to achieve optimal data retrieval in multi-column indexes.
Table Access Path Selection

Multi-column filters are often more complex than the basic examples discussed earlier. They may include AND conditions, OR conditions, or a combination of both. TiDB’s range derivation subsystem is designed to handle these cases efficiently, generating the most selective (and therefore, most effective) index ranges.

In general, the subsystem applies a UNION operation for ranges generated from OR conditions and an INTERSECT operation for ranges derived from AND conditions. This approach ensures that TiDB can filter data as precisely as possible, even with complex filtering logic.

Disjunctive Conditions (OR Conditions) in Multi-Column Indexes

When there are OR conditions in a query (known as “disjunctive predicates”), the optimizer handles each condition separately, creating a range for each part of the OR condition. If any of these ranges overlap, the optimizer merges them into one continuous range. If they don’t overlap, they remain as separate ranges, both of which can still be used for an index scan.

Example 1: Overlapping Ranges

Consider a query that looks for listings in New York with two bedrooms, where the price falls into one of two overlapping ranges:

  • Price between $1,000 and $2,000
  • Price between $1,500 and $2,500

In this case, the two ranges overlap, so the optimizer combines them into a single range from $1,000 to $2,500. Here’s the query and its execution plan:

"Q3"
EXPLAIN FORMAT=BRIEF 
    SELECT * FROM listings 
    WHERE (city = 'New York' AND bedrooms = 2 AND price >= 1000 AND price < 2000) 
       OR (city = 'New York' AND bedrooms = 2 AND price >= 1500 AND price < 2500);
 -----+------------------------------------------------------+--------------------------
| id  task                | access object                | operator info   
+-----------------------------+---------+-----------+-----------------------------------
| IndexLookUp             | root                         | 
| ├─IndexRangeScan(Build) | table:listings,              | range:
                          | index:idx_city_bedrooms_price| ["New York" 2 1000.00,
                          | (city, bedrooms, price)      |  "New York" 2 2500.00)
| └─TableRowIDScan(Probe) | table:listings               | 
+-----------------------------+---------+-----------+-----------------------------------

Example 2: Non-Overlapping Ranges

In a different scenario, imagine a query that looks for affordable single-bedroom listings in either San Francisco or San Diego. Here, the OR condition specifies two distinct ranges for different cities:

  • Listings in San Francisco, 1 bedroom, priced between $1,500 and $2,500
  • Listings in San Diego, 1 bedroom, priced between $1,000 and $1,500

Since there’s no overlap between these ranges, they remain separate in the execution plan, with each city having its own index range:

"Q4"

EXPLAIN FORMAT=BRIEF 
    SELECT * FROM listings 
    WHERE
        (city = 'San Francisco' AND bedrooms = 1 AND price >= 1500 AND price < 2500) 
     OR (city = 'San Diego' AND bedrooms = 1 AND price >= 1000 AND price < 1500);
 -----+------------------------------------------------------+--------------------------
| id  task                | access object                | operator info   
+-----------------------------+---------+-----------+-----------------------------------
| IndexLookUp             | root                         | 
| ├─IndexRangeScan(Build) | table:listings,              | range:
                          | index:idx_city_bedrooms_price| ["San Francisco" 1 1500.00,
                          | (city, bedrooms, price)      |  "San Francisco" 1 2500.00) 
                          |                              | ["San Diego" 1 1000.00,
                          |                              |   "San Diego" 1 1500.00)
| └─TableRowIDScan(Probe) | table:listings               | 
+-----------------------------+---------+-----------+-----------------------------------

By creating either merged or distinct ranges based on overlap, the optimizer can efficiently use indexes for OR conditions, avoiding unnecessary scans and improving query performance.

Conjunctive Conditions (AND Conditions) in Multi-Column Indexes

For queries with AND conditions (also known as conjunctive conditions), TiDB’s optimizer creates a range for each condition. It then finds the overlap (intersection) of these ranges to get a precise result for index access. If each condition has only one range, this is straightforward, but it becomes more complex if any condition contains multiple ranges. In such cases, TiDB combines these ranges to produce the most selective, efficient result.

Example Table Setup

Let’s consider a table t1 defined as follows:

CREATE TABLE t1 (
    a1 INT,
    b1 INT,
    c1 INT, 
    KEY iab (a1,b1)
);

Suppose we have a query with the following conditions:

  • (a1, b1) > (1, 10) AND (a1, b1) < (10, 20)

This query involves comparing multiple columns and requires two steps to process:

Step 1: Expression Translation:

TiDB’s optimizer breaks down these complex conditions into simpler parts.

  • (a1, b1) > (1, 10) translates to (a1 > 1) OR (a1 = 1 AND b1 > 10), meaning it includes all cases where a1 is greater than 1 or where a1 is exactly 1 and b1 is greater than 10.
  • (a1, b1) < (10, 20) translates to (a1 < 10) OR (a1 = 10 AND b1 < 20), covering cases where a1 is less than 10 or where a1 is exactly 10 and b1 is less than 20.

These expressions are then combined using AND:

((a1 > 1) OR (a1 = 1 AND b1 > 10)) AND ((a1 < 10) OR (a1 = 10 AND b1 < 20))

Step 2: Range Derivation and Combination:

After breaking down the conditions, TiDB’s optimizer calculates ranges for each part and combines them. For this example, it derives:

  • For (a1, b1) > (1, 10): it creates ranges like (1, +inf] for cases where a1 > 1 and (1, 10, 1, +inf] for cases where a1 = 1 and b1 > 10.
  • For (a1, b1) < (10, 20): it creates ranges [-inf, 10) for cases where a1 < 10 and [10, -inf, 10, 20) for cases where a1 = 10 and b1 < 20.

The final result combines these to get a refined range: (1, 10, 1, +inf] UNION (1, 10) UNION [10, -inf, 10, 20).

Query Plan Example

Here’s the query plan showing the derived ranges:

"Q5"
EXPLAIN FORMAT=BRIEF 
    SELECT * FROM t1 
    WHERE (a1, b1) > (1, 10) AND (a1, b1) < (10, 20);
 -----+------------------------------------------------------+--------------------------
| id  task                | access object                | operator info   
+-----------------------------+---------+-----------+-----------------------------------
| IndexLookUp             | root                       | 
| ├─IndexRangeScan(Build) | table:t1,index:iab(a1, b1) | range:(1 10,1 +inf],
                          |                            |       (1,10)
                          |                            |       [10 -inf,10 20)
| └─TableRowIDScan(Probe) | table:t1                   | 
+-----------------------------+---------+-----------+-----------------------------------

In this example, the table has about 500 million rows. However, this optimization allows TiDB to narrow down the access to only around 4,000 rows—just 0.0008% of the total data. This refinement drastically reduces query latency to a few milliseconds, as opposed to over two minutes without optimization.

Unlike MySQL, which would require a full table scan for such conditions, TiDB’s optimizer can handle complex row expressions efficiently by leveraging these derived ranges.

Conclusion

TiDB’s optimizer uses multi-column indexes and advanced range derivation to significantly lower data access costs for complex SQL queries. By effectively managing both conjunctive (AND) and disjunctive (OR) conditions, TiDB converts row-based expressions into optimal access paths, reducing query times and enhancing performance. Unlike MySQL, TiDB supports union and intersection operations on multi-column indexes, allowing efficient processing of intricate filters. In practical use, this optimization enables TiDB to complete queries in just a few milliseconds—compared to over two minutes without it—demonstrating a substantial reduction in latency.

Check out our comparision white paper to discover even more differences between MySQL and TiDB’s architecture—and why this matters for scalability, reliability, and hybrid transactional and analytical workloads.


Download Now


Experience modern data infrastructure firsthand.

Try TiDB Serverless

Have questions? Let us know how we can help.

Contact Us

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Serverless

A fully-managed cloud DBaaS for auto-scaling workloads