Introduction to concurrency control
Serializability is the classical concurrency scheme. It ensures that a schedule for executing concurrent transactions is equivalent to one that executes the transactions serially in some order. Though serializablity is a great concept, it is hard to implement efficiently. A classical solution is a variant of Two-Phase Locking, aka 2PL. Using 2PL, the database management system (DBMS) maintains read and write locks to ensure that conflicting transactions are executed in a well-defined order, or in serializable execution schedules. Locking, however, has several drawbacks. First, readers and writers block each other. Second, most transactions are read-only and are therefore harmless from a transaction-ordering perspective. Under a locking-based isolation mechanism, no update transaction is allowed on a data object that is being read by a potentially long-running read transaction. Thus the update has to wait until the read finishes. This severely limits the degree of concurrency in the system.
Multi-Version Concurrency Control (MVCC) is an elegant solution for this problem, where each update creates a new version of the data object instead of updating it in-place, so that concurrent readers can still see the old version while the update transaction proceeds. Such a strategy can prevent read-only transactions from waiting. In fact, locking is not required at all. This is an extremely desirable property and the reason why many database systems like PostgreSQL, Oracle, and Microsoft SQL Server implement MVCC.
In this post, we will explore the complexities of implementation of MVCC in TiKV.
MVCC in TiKV
Let’s dive into TiKV
‘s MVCC implementation, which is located at src/storage.
Timestamp Oracle(TSO)
Since TiKV
is a distributed storage system, it needs a globally unique time service, called Timestamp Oracle
(TSO), to allocate a monotonic increasing timestamp. Similar to the TrueTime API from Google’s Spanner[1], this service is implemented in Placement Driver (PD) in TiKV. Every TS
represents a monotonic increasing timestamp.
Storage
To dive into the transaction part in TiKV
, src/storage is a good starting point. Storage
is a struct that actually receives the Get/Scan commands.
pub struct Storage {
engine: Box<Engine>,
sendch: SendCh<Msg>,
handle: Arc<Mutex<StorageHandle>>,
}
impl Storage {
pub fn start(&mut self, config: &Config) -> Result<()> {
let mut handle = self.handle.lock().unwrap();
if handle.handle.is_some() {
return Err(box_err!("scheduler is already running"));
}
let engine = self.engine.clone();
let builder = thread::Builder::new().name(thd_name!("storage-scheduler"));
let mut el = handle.event_loop.take().unwrap();
let sched_concurrency = config.sched_concurrency;
let sched_worker_pool_size = config.sched_worker_pool_size;
let sched_too_busy_threshold = config.sched_too_busy_threshold;
let ch = self.sendch.clone();
let h = try!(builder.spawn(move || {
let mut sched = Scheduler::new(engine,
ch,
sched_concurrency,
sched_worker_pool_size,
sched_too_busy_threshold);
if let Err(e) = el.run(&mut sched) {
panic!("scheduler run err:{:?}", e);
}
info!("scheduler stopped");
}));
handle.handle = Some(h);
Ok(())
}
}
This start
function in the example above explains how the storage struct runs.
Engine
Engine is the trait which describes the actual database used in the storage system. It is implemented in raftkv and rocksdb_engine.
StorageHandle
StorageHandle
is the struct that handles commands received from sendch
. The I/O is processed by mio
.
Then functions like async_get
and async_batch_get
in the storage
struct will send the corresponding commands to the channel, which can be obtained by the scheduler to execute asynchronously.
The MVCC layer is called in Scheduler.
The storage receives commands from clients and sends commands as messages to the scheduler. Then the scheduler will process the command or call corresponding asynchronous function. There are two types of operations – read and write. Read is implemented in MvccReader, which is easy to understand so we will not elaborate on it. Let’s focus on write, which is the core of MVCC implementation.
MVCC
Column family
Compared with Percolator where the information such as Lock is stored by adding an extra column to a specific row, TiKV uses a column family (CF) in RocksDB to handle all the information related to Lock. To be specific, TiKV stores the Key-Values, Locks and Writes information in CF_DEFAULT
, CF_LOCK
, and CF_WRITE
.
All the values of the CF are encoded as following:
| | Default | Lock | Write |
| — | — | — | — |
| Key | z{encoded_key}{start_ts(desc)} | z{encoded_key} | z{encoded_key}{commit_ts(desc)} |
| Value | {value} | {flag}{primary_key}{start_ts(varint)} | {flag}{start_ts(varint)} |
More details can be found here.
Transaction model
Here comes the core of the transaction model for TiKV, which is MVCC powered by 2-phase commit. There are two stages in one transaction:
Prewrite
- The transaction starts. The client obtains the current timestamp (
startTS
) from TSO. - Select one row as the primary row, the others as the secondary rows.
- Check whether there is another lock on this row or whether there are any commits located after
startTS
. These two situations will lead to conflicts. If either happens, the commit fails and rollback will be called. - Lock the primary row.
- Repeat the steps above on secondary rows.
- The transaction starts. The client obtains the current timestamp (
Commit
- Obtain the commit timestamp
commit_ts
from TSO. - Check whether the lock on the primary row still exists. Proceed if the lock is still there. Roll back if not.
- Write to column
CF_WRITE
withcommit_ts
. - Clear the corresponding primary lock.
- Repeat the steps above on secondary rows.
- Obtain the commit timestamp
Garbage collector
It is easy to predict that there will be more and more MVCC versions if there is no Garbage Collector to remove the invalid versions. But we cannot simply remove all the versions before a safe point, for there may be only one version for a key, which must be kept. In TiKV, if there is any Put
or Delete
records before the safe point, then all the latter writes can be deleted; otherwise only Delete
, Rollback
and Lock
will be deleted.
References:
Experience modern data infrastructure firsthand.
TiDB Cloud Dedicated
A fully-managed cloud DBaaS for predictable workloads
TiDB Cloud Serverless
A fully-managed cloud DBaaS for auto-scaling workloads