Transactions—especially distributed transactions—are ubiquitous. Protocols around transactions are equally ubiquitous, even if we don’t immediately realize it.
Take, for example, a common marriage ceremony. It’s essentially a two-phase commit (2PC) protocol. The officiant is the transaction coordinator (TC), and the couple getting married are the active participants. In the first phase, the TC asks each participant if they are “ok to commit”. Only when both say “ok to commit” (i.e., I do) can the officiant write the details and commit a timestamp in the marriage register. This officially commits the transaction (i.e., Phase Two, pronouncing the couple married).
In the database world (as well as real world), the fundamental idea underpinning any transaction is the four “ACID” properties:
- Atomicity: Either all the operations within the transaction complete or none of them complete (i.e., either the couple marry or they don’t.)
- Consistency: The system state goes from one consistent state to another consistent state. From our example, the participants go from the state of being single to married.
- Isolation: When transactions run concurrently, their execution can organize as if they were running in some sequential order. So, in our example, if the priest officiates a double wedding, then the priest ensures the “I do’s” don’t mix and cause confusion. We are primarily concerned here with transaction consistency.
- Durability: Once a transaction completes (commits), its changes are permanent. Back to our example, if someone wakes up the next morning and claims it was all a bad dream, then the register has the proof of commitment. To divorce, they will have to start a new transaction.
In this post, we’ll explore transactions in greater depth through the lens of TiDB, a popular open-source distributed SQL database. Specifically, we’ll uncover how TiDB supports distributed transactions in the context of a distributed system.
Exploring TiDB’s Architecture
TiDB is a highly-available distributed database with a disaggregated compute and storage architecture, enabling seamless horizontal scaling. This scalability empowers dynamic resizing of storage and compute nodes within a TiDB cluster. However, this elastic and distributed structure introduces unique considerations when compared to a monolithic architecture.
TiDB’s high availability (HA) architecture ensures the reliability and continuous operation of the TiDB database. TiDB’s HA solution is robust and designed to provide continuous availability, fault tolerance, and resilience in the face of hardware failures or other disruptions. The key features of TiDB’s HA solution include:
- PD (Placement Driver): PD is responsible for managing the metadata and the cluster’s topology.
- TiKV: The distributed transactional key-value store used by TiDB. It employs the Raft consensus algorithm for data replication, ensuring data consistency and high availability across multiple nodes.
- Automatic failover: In the event of a node failure, the Raft consensus algorithm facilitates automatic leader election and failover. This ensures that the system continues to function seamlessly without manual intervention.
- Load balancing: TiDB incorporates load balancing mechanisms to distribute the workload evenly across the cluster. This prevents resource bottlenecks and optimizing performance.
- Monitoring and alerting: TiDB comes with built-in monitoring and alerting tools to help administrators track the health and performance of the cluster. This facilitates proactive management and a quick response to potential issues.
What’s Inside a TiDB Cluster?
A TiDB cluster comprises various services that communicate over a network. The focal point, represented by the red box in the above diagram, is the TiDB cluster service endpoint. This endpoint takes on the role of handling the MySQL protocol and contains the SQL parser and optimizer. It also plays the role of a Transaction Coordinator (TC).
When we refer to the TiDB cluster in this discussion, we encompass all its services working in concert. For simplicity, we’ll denote the endpoint and SQL service as TiDB.
This disaggregated approach to compute and storage in a TiDB cluster not only facilitates dynamic scaling. It also enhances the overall robustness of the system, making it adept at handling various failure scenarios.
You can get more in depth details from this documentation. For Transactions, the two important services are: Placement Driver (PD) and TiKV.
How Does TiDB Support Distributed Transactions?
TiDB cluster leverages the Raft consensus protocol as a foundational element to address network-related challenges in distributed storage systems. Raft provides the robust guarantees for the transaction model.
In a TiDB cluster, transactions feature a start timestamp and a commit timestamp. The commit timestamp is crucial for externalizing changes, making them visible to other transactions. The generated timestamps must adhere to two key constraints: global uniqueness and monotonically increasing values. To prevent a single point of failure, a cluster of PDs (i.e., Placement Drivers, or a meta-data server) handles timestamp generation, among other responsibilities.
Percolator Algorithm for Distributed Transactions
The Percolator algorithm, originating from Google and designed for massive scale with high throughput requirements, implements distributed transactions in TiKV. Originally developed for web crawling and incremental document indexing in Google’s search engine, Percolator uses Snapshot Isolation (SI) for enhanced parallelism. For an explanation on how SI works see the detailed explanation below.
TiDB cluster’s implementation of the Percolator algorithm optimizes for Online Transaction Processing (OLTP). TiDB transaction optimizations detail the specific optimizations tailored for OLTP scenarios. Additionally, TiDB uses 1PC (i.e., single-phase commit) for common cases when all the keys commit in a single pre-write on TiKV.
One notable feature of Percolator’s design is the statelessness of the transaction coordinator (TiDB SQL node). This eliminates the need to persist any state locally. However, as an optimization, it may buffer uncommitted transactions locally; this helps in the case of 1PC optimization mentioned earlier. A cluster of TiKVs store the locks and their states. Pre-write locks persist, but not for an extended period. This ensures minimal impact even in the rare case of a coordinator failure with remaining pre-write locks.
A single check for transaction status and an optional resolve lock RPC efficiently handle related key read-write conflicts. Optimizations, such as caching transaction resolve states in the TiDB (TC) node and resolving all transaction locks related to a region in a single operation, further reduce overhead. This lazy cleanup design has negligible impact on OLTP systems, even in scenarios with extremely low latency requirements. Additionally, it simplifies the complexity of maintaining transaction contexts.
Snapshot Isolation (SI)
Before diving into the detailed workings of Percolator, it’s beneficial to understand Snapshot Isolation, its guarantees, and anomalies. For a comprehensive explanation, refer to this document (and its references)[1]. Anyone interested in transactions and their isolation levels in databases will find this a valuable read.
A linked list with three nodes A, B, and C can explain Snapshot Isolation.
Let’s say we have two transactions T1 and T2.
- T1 wants to update node C and T2 wants to update node A.
- T1 traverses the above list and reads the contents of node A=1 and B=2. It then sets the value of node C to 100.
- T2 changes the contents of node A to 100.
The changes by T1 and T2 lead to a read-write conflict between T1 and T2. Normally, one of the transactions will abort. In the Snapshot Isolation case, because T1 is not interested in the value of node A, it only reads it when traversing to get to C. We can reorder T2 before T1 or T1 before T2 and the end result will be the same. One way to think about this is T1 taking a snapshot of the linked list before it starts.
Serializabile Isolation
In contrast, in a Serializable Isolation level, T1 would take at least a shared lock on all the nodes traversed by T1 (A and B), as per the 2PL protocol. These read locks of T1 would have blocked the update of node A by T2. PostgreSQL uses a variation called Serialized Snapshot Isolation (SSI)[2]. SSI takes shared locks, but these shared locks don’t block other transactions. The transaction locks are instead checked at commit time for Serializability violations. A rollback then occurs at the commit stage if there is a violation. As a result, SSI is prone to suffer from serious performance problems if there is lock contention.
Serializability is stronger than Snapshot Isolation and prevents write skew which Snapshot Isolation does not. A write skew exception occurs when two concurrent transactions read different but related records. From there, each transaction updates the data it reads and eventually commits the transaction. If there is a constraint between these related records that cannot be modified concurrently by multiple transactions, then the end result will violate the constraint.
Snapshot Isolation is more performant because it allows for more parallelism. Similar to MySQL/InnoDB in Repeatable Read, you can prevent write skew by using the SELECT FOR UPDATE syntax.
As long as parallel transactions don’t modify the same piece of data, there is no conflict between the transactions. We can summarize the rules as follows:
- When transactions start they each take a snapshot of the database.
- Disjoint updates can continue in parallel.
- If another transaction (T2) running in parallel changes and commits the snapshotted values of a transaction (T1), then the affected transaction (T1) can not commit its changes.
Percolator Algorithm in Action
The Percolator algorithm provides the transaction ACID guarantees that we need using locks and timestamps under the hood. The global (T2.start) timestamp on the objects—e.g., on node A, B and C in our example list—is what constitutes T2’s snapshot.
To be compatible with MySQL/InnoDB, TiDB advertises Snapshot Isolation as Repeatable Read. This introduces a subtle difference to keep in mind.
Let’s go through an example to better understand the Percolator algorithm. In our example, Bob wants to transfer $10 from his account to Alice’s account.
Step 1: Set Up an Account
First, we set up the account with some money in both accounts.
CREATE TABLE T (ac VARCHAR(20) PRIMARY KEY NOT NULL, balance DECIMAL(8,2));
BEGIN;
INSERT INTO T VALUES("Bob", 110);
INSERT INTO T VALUES("Alice", 90);
COMMIT;
The numbers 7, 6 below are the transaction start timestamps. “7: Data@6” in the Write column means that it’s a “pointer” to the Key’s Data column value at timestamp 6. Another way of reading this could be:
Bob’s balance := Bob.Data.find(6); // := 110
Key | Data | Lock | Write |
Bob | 7: 6: 110 | 7: 6: | 7: Data@6 6: |
Alice | 7: 6: 90 | 7: 6: | 7: Data@6 6: |
Step 2: Transfer Money Into Account
Next, we transfer money from Bob’s account into Alice’s account.
BEGIN;
UPDATE T SET balance = balance - 10 WHERE ac = 'Bob';
UPDATE T SET balance = balance + 10 WHERE ac =' Alice';
COMMIT;
The transaction goes through two phases, first the pre-write phase where locks are acquired, and then the commit phase where locks are released and the value is externalized. The pre-write phase does a temporary write to Bob’s account. We can see that by looking at the Write column, the last committed pointer still points to its old value of 110. Percolator has a concept of primary lock and secondary lock. With this concept, the secondary lock points to the primary lock using the same pointer syntax mentioned above.
The “primary” lock is a randomly selected row from the transaction row set; it’s the key which will hold the lock that all secondary locks point to and doesn’t have any other significance.
Key | Data | Lock | Write |
Bob | 8: 100 7: 6: 110 | 8: primary 7: 6: | 8: 7: Data@6 6: |
Alice | 7: 6: 90 | 7: 6: | 7: Data@6 6: |
Step 3: Update the Account
Next, we update Alice’s account, which is also a temporary write. This time we acquire a secondary lock. This lock “points” to the primary lock held by Bob at the same timestamp (8).
Key | Data | Lock | Write |
Bob | 8: 100 7: 6: 110 | 8: primary 7: 6: | 8: 7: Data@6 6: |
Alice | 8: 100 7: 6: 90 | 8: primary@Bob 7: 6: | 8: 7: Data@6 6: |
Now we follow through with the second phase. In this phase, we want to commit the data, release the locks and externalize it. The new state of both balances after the commit and lock release follow the pointers in the Write column.
Bob’s balance := Bob.Data.find(8); // := 100
Alice’s balance := Alice.Data.find(8); // := 100
Key | Data | Lock | Write |
Bob | 9: 8: 100 7: 6: 110 | 9: 8: 7: 6: | 9: Data@8 8: 7: Data@6 6: |
Alice | 9: 8: 100 7: 6: 90 | 9: 8: 7: 6: | 9: Data@8 8: 7: Data@6 6: |
Handling Conflicts with Distributed Transactions
Conflicts are identified by checking the Lock column. A row can have many versions of data, but it can have at most one lock at any given time. During a write operation, an attempt is made to lock every affected row in the pre-write phase. If any lock acquisition fails, the transaction is rolled back. Using an optimistic lock algorithm, sometimes Percolator’s transactional write may encounter performance regressions in scenarios where conflicts occur frequently.
TiDB supports both the optimistic and pessimistic commit model. The pessimistic commit model is what you will be used to if you are coming from a MySQL background. Optimistic commit is used for single-statement auto-commit.
Comparing TiDB with Vitess
Vitess and TiDB are both distributed databases that support the MySQL wire protocol. They’re designed to provide scalability and high availability. However, they have very different architectures and use cases. Vitess acts more like a proxy over a cluster (or shards) of MySQL instances.
TiDB’s Distributed Transactions Remain ACID Compliant Across Multiple Shards
Vitess provides REPEATABLE READ for single-shard transactions only. The application developer has to be aware of the data location within the cluster, which is implicit in the key sharding strategy. If the application developer wants multi-shard transactions, READ COMMITTED is the only available option. Also, the scope of the transaction isolation setting is only at the session level.
There are other issues with cross-shard transactions in Vitess. Since Vitess acts as a proxy, it has to issue independent commits on all the shards that participate in a cross-shard transaction. This means Vitess commits are not atomic in a multi-shard transaction. If any participating shard fails to commit, then the burden is pushed to the application developer to rollback the partially committed transaction’s changes explicitly on the shards that had successfully committed the transaction before the failure. In contrast, TiDB’s distributed transactions are fully ACID compliant in all use cases.
The choice between Vitess and TiDB depends on the specific requirements of your application. If your focus is on simple key-value operations, optimized for single-shard transactions and high throughput, Vitess may be a suitable choice. However, if you need a distributed database with strong consistency, fully ACID-compliant transactions, support for complex SQL queries, and versatility across different workloads, TiDB is a robust option.
Conclusion
This post explores the efficiency and reliability of distributed transactions through the lens of TiDB. By leveraging the Percolator protocol, TiDB achieves ACID guarantees in a distributed environment.
Percolator uses the snapshot isolation model. This allows transactions to take snapshots of the database at initiation, promote parallel updates, and ensure consistency across distributed nodes.
Comparisons with traditional isolation levels, such as Serializability, shed light on the nuanced trade-offs in performance and constraint enforcement. The Snapshot Isolation protocol’s compatibility with MySQL/InnoDB’s repeatable read is highlighted.
Percolator’s role in providing robust ACID guarantees in distributed systems offers a compelling solution for balancing performance and transaction integrity.
TiDB excels in independently scaling both compute and storage, supporting global distribution, and optimizing operations for diverse data environments. Its modern distributed database design, complete with robust support for transactional ACID guarantees, enables seamless handling of evolving workloads and scaling requirements. This ensures consistent and stable performance in demanding scenarios.
Check out the following resources for more information:
- eBook: Modern Distributed Database Fundamentals
- On-demand: Introduction to TiDB
- Blog: Deep Dive into TiKV Transactions
References
Experience modern data infrastructure firsthand.
TiDB Cloud Dedicated
A fully-managed cloud DBaaS for predictable workloads
TiDB Cloud Serverless
A fully-managed cloud DBaaS for auto-scaling workloads