TiDB DDL Optimizations: Unleashing 50x Performance Increases

Managing schema changes in traditional databases often leads to downtime, blocking, and operational complexity. TiDB has long simplified this process with its online DDL capabilities, allowing developers to evolve their databases without disrupting applications.

As user bases and data volumes have surged, however, index creation was increasingly becoming a performance bottleneck. To address this, we first achieved a 10x speed improvement in index building. Then, by migrating index creation to a distributed framework, we further boosted indexing efficiency for large tables.

Today, with the rise of SaaS applications and the challenge of managing millions of tables within a single TiDB cluster, our DDL framework faces even greater demands. We’ve made scalability and DDL execution efficiency a top priority, ensuring that TiDB can handle massive DDL operations with high throughput while maintaining rock-solid stability under heavy concurrency and load. The following results showcased in this blog highlight these key advancements.

TiDB 8.2: 5x Average DDL QPS Increase

After extensive performance testing, TiDB 8.2 DDL shows a 5x QPS improvement over TiDB 8.1, from 7 to 38 (peaking at 80), demonstrating successful DDL optimization.

TiDB 8.2: Fivefold average DDL QPS increase.

TiDB 8.3: Another 4x Average DDL QPS Increase

Significant improvement could be seen in TiDB 8.3 DDL performance. This release achieved ~180 average and ~200 peak QPS, a near 5x increase over TiDB 8.2, demonstrating substantial DDL optimization and increased stability.

TiDB 8.3: Another fourfold DDL average QPS increase.

TiDB 8.5: 50x Faster Table Creation Times

TiDB 8.5 significantly improves DDL performance, especially for a large number of databases. Enabling Fast Create Table optimization doubles DDL QPS and overall throughput. Million-table tests show TiDB 8.5 creates tables 50x faster than TiDB 7.5. Testing on the below cluster generated the following results:

Node type	Number	Specifications
PD	1	8C16G
TiDB	3	16C32G
TiKV	3	8C32G

Note: A 16-core, 32GB TiDB node is used here due to the memory requirements of statistics during testing. Memory usage for statistics is being actively reduced. Adjust table counts based on your own node specifications.

TiDB DDL: An Overview of Underlying Principles

Understanding TiDB DDL execution is crucial before exploring optimization strategies. TiDB, a distributed SQL database with online DDL capabilities, allows schema modifications without disrupting transactions. This section outlines the online DDL process, from SQL parsing and job creation to background execution, setting the stage for discussing optimizations.

DDL Statement Task Running Process

Figure 1. A diagram depicting TiDB DDL execution flow.

TiDB processes a CREATE TABLE statement by parsing SQL, creating a DDL task, scheduling and executing it with Job Workers (potentially in parallel for reorg tasks), tracking progress, and returning the result.

Introduction to Online Schema Changes

Figure 2. A diagram representing how online schema changes work in TiDB.

Job Workers execute TiDB’s online schema changes, applying single-step schema changes. They also notify the Placement Driver (PD), which coordinates updates across all TiDB nodes (using etcd) and requires nodes to acquire Metadata Locks (MDLs). Finally, Job Workers ensure all nodes synchronize before proceeding to the next step, thereby maintaining a two-state invariant and preventing disruption to ongoing transactions.

An Exploration of Engineering Best Practices

Figure 3. TiDB DDL milestones.

Guided by customer needs, an iterative approach, and a principle of minimal impact, TiDB’s DDL optimization roadmap breaks down complex tasks into independent, deliverable sub-tasks, enabling continuous improvement and rapid delivery of targeted solutions.

Optimization Approaches

Faced with large-scale table creation bottlenecks, TiDB strategically optimized DDL operations. It focused on “Quick Table Creation” through targeted improvements, continuous iteration (reducing million-table creation time from over 4 hours to 1.5-2 hours), code refactoring, and a future distributed DDL framework for greater performance.

The following table illustrates the optimization achieved in TiDB 8.1.

TiDB’s Optimization Journey: An In-Depth Look

Following TiDB 8.1’s table creation improvements, TiDB 8.2 and later versions focused on optimizing general DDL execution for large-scale deployments, targeting increased throughput, stability, and efficiency.

Identifying Performance Bottlenecks

Analysis of TiDB’s DDL execution, especially in large-scale, multi-tenant environments with millions of tables, revealed performance bottlenecks stemming from rapid iteration and the need for a more scalable model. These key bottlenecks were:

Inefficient DDL Task Scheduling: DDL tasks processed sequentially, incurring unnecessary scheduling overhead.
Slow Database/Table Existence Checks: Schema validation sometimes relied on slower fallback mechanisms.
Underutilized Computing Resources: TiDB nodes were not fully leveraged for concurrent execution.
Inefficient Broadcasting Mechanisms: Schema changes propagated across nodes inefficiently, causing delays.

Systematic improvements to these areas transformed TiDB’s DDL execution into a highly efficient, distributed process.

Optimizations Unpacked

TiDB’s initial scheduling strategy, processing granular state machine steps for each DDL statement, resulted in excessive scheduling overhead. To address this, several key optimizations were implemented:

Scheduling Granularity Adjustment: Instead of individual steps, entire DDL tasks are now treated as single units, significantly reducing overhead and improving efficiency.
Concurrency Enhancement: Independent DDL tasks now execute in parallel, maximizing resource utilization and shortening overall execution time.
Execution Resource Expansion: The worker pool dedicated to general DDL tasks has been expanded, enabling simultaneous execution of multiple tasks and dramatically increasing throughput.
Scheduling Logic Simplification: Optimized algorithms have streamlined the scheduling process, further enhancing efficiency.

Optimizing Database and Table Existence Checks

Efficient table existence checks are crucial for TiDB’s DDL performance. Previously, TiDB used an in-memory schema cache with a fallback to TiKV, causing delays under high concurrency. To optimize this, the TiKV fallback terminates, relying solely on the schema cache. This is justified by reliable schema cache synchronization on the DDL owner node, pre-computed job dependencies preventing concurrent table creation, sequential job execution ensuring correct schema updates, and schema reloading when a node becomes the DDL owner.

func checkTableNotExists(d *ddlCtx, t *meta.Meta, schemaID int64, tableName string) error {
        // Try to use memory schema info to check first.
        currVer, err := t.GetSchemaVersion()
        if err != nil {
                return err
        }
        is := d.infoCache.GetLatest()
        if is.SchemaMetaVersion() == currVer {
                return checkTableNotExistsFromInfoSchema(is, schemaID, tableName)
        }

        return checkTableNotExistsFromStore(t, schemaID, tableName)
}

By removing the TiKV lookup, execution speed and efficiency significantly improved, especially in large-scale deployments. This optimization enhances DDL scalability while maintaining consistency and correctness. Future enhancements include indexing for table/schema names and fault tolerance mechanisms for the schema cache.

Improving Utilization of Computing Resources

TiDB’s schema synchronization mechanism initially relied on timed polling, where the system repeatedly checked for schema updates at fixed intervals. This approach introduced inefficiencies—frequent schema changes or slow schema loading led to excessive invalid checks, slowing down the system.

To resolve this, we adopted ETCD’s Watch mechanism. Instead of periodic polling, TiDB now listens to real-time changes in ETCD. When a schema version changes, ETCD notifies TiDB immediately, allowing for on-demand synchronization. This enhancement:

Eliminates unnecessary polling, reducing system overhead.
Improves response time, ensuring faster propagation of schema changes.
Enhances resource efficiency, allowing TiDB to focus computing power on execution rather than repeated checks.

Replacing the Broadcast Mechanism for DDL Completion

Optimizing TiDB’s DDL completion notification mechanism required replacing inefficient broadcast notifications with directional notifications. This ensures only the responsible thread processes a DDL completion event, preventing redundant processing, speeding up completion, and enhancing throughput for high-volume schema changes.

Refactoring the TiDB DDL Framework for Future Scalability

To ensure TiDB’s DDL framework remains adaptable and efficient, a comprehensive refactoring initiative was undertaken. The previous framework, burdened by technical debt, suffered from an aging design, poor code maintainability, insufficient testing, and slow iteration.

This refactoring focused on improving code quality through modularity and loose coupling, enhancing testing coverage, and optimizing the architecture for better scalability and fault tolerance. An incremental approach, with small validated changes, continuous integration, and code reviews, minimized risk.

The refactored framework delivers greater stability, higher development efficiency, and improved scalability. This, combined with previous optimizations in scheduling, concurrency, and schema validation, significantly enhances TiDB’s DDL performance, ensuring its ability to handle evolving demands and future growth.

By continuously refining TiDB’s DDL execution, we’re laying the groundwork for the next generation of distributed schema management.

Conclusion

Refactoring the TiDB DDL framework was a complex undertaking. However, the resulting benefits have been substantial. Improved stability, efficiency, and scalability provide a solid foundation for future growth.

If you have any questions about TiDB’s DDL execution capabilities, please feel free to connect with us on Twitter, LinkedIn, or through our Slack Channel.