Architecture

Resources

Transcript

In this video, we are going to provide a brief introduction to how RocksDB works. Or even more specifically, how an LSM tree works, which is the basic data structure of RocksDB, which differs from InnoDB that uses a B+tree. We will also look at the filesystem to explain what you are seeing in TiKV's data directory.

Let's start with the basic operation of inserting new rows into RocksDB:

First, it's important to state that the “core data structure” of the LSM is around key-value pairs. So I will not be using the term rows, but key-value pairs. If your table just has a primary key, then that will be the key, and the row data will be the value. If you have additional indexes, then those will also have additional key-values, where the value will be the primary key. It is also important to state that despite being key-value, it is also transactional, so we can guarantee that multiple key-value pairs are updated atomically without any inconsistencies.

OK. So let's go through the example of inserting a new key-value pair in RocksDB:

Similar to InnoDB, the first stage of any modification is to log it. RocksDB has a write-ahead-log which records history of the change for recovery purposes only. In InnoDB, these are sometimes called transaction logs or redo logs, but the process is very similar. Step #1 is to make sure that if there was a crash, the log can be replayed and durability is ensured.

Once our key-value pair is in the write-ahead-log, it is inserted into an in-memory structure called a memtable. You can think of the memtable as like a buffer that is being filled up with changes. The goal is to be able to collect a batch of key-value pairs and then write them out in sorted order into an SST file.

SST stands for static-stored-table. This is the core file structure of an LSM tree like RocksDB. As the name suggests, SST files contain a set of key-value pairs that are sorted in order. They are also only ever written once, and never updated in place.

So if you look inside db/ of the RocksDB data directory, you will see a set of .sst files and a set of LOG files. The log files being the WAL that we mentioned are just prior.

So as we keep batch inserting new rows, the memtable will fill and keep writing out new SST files. If we change that operation, and say that is now an update operation: well, because we said that SST files are never modified in place, the operation is very similar. We will write out a new SST file with the new key-value pair that takes precedence over the old value.

So over time what might happen is that older SST files contain a lot of keys which have since been modified. It is now time to talk about one of the essential components of an LSM which is compaction. Compaction is the process of merging SST files and creating new SST files to replace them.

So in this example, we can do a very efficient merge-sort to combine these two SST files into a new SST. Because the SST files are read-only, this operation can be done in the background. When the operation is complete, we now have a new larger SST file to replace the earlier files, which can now be deleted.

So now that we've introduced merging of files, let's introduce levels. The initial SST files written out from the memtable can be considered Level zero. We then merge these files to create level 1, and those files will over time be merged to create level two, and so on. Compaction can be seen as a tradeoff, and the more frequently (and aggressively) it runs, the more efficient the LSM tree will be for read operations. Compaction can also help free disk space, and you might see the free space of your TiKV servers fluctuates slightly as it is happening.

RocksDB also allows different compression algorithms to be used per level, and in the default configuration, you will find that L0-L1 use no compression, L2-L4 use lz4 and L5-L6 use zstd, which while still fast, is much better for compression.

So we've now provided a simplified description of how RocksDB works, there is only one key detail which is pertinent and that is the bloom filter. You can think of a bloom filter as like a special type of index. For a given key, it can accurately determine if a given key may exist or definitely does not exist in an SST file. While the may exist might sound strange since it is not a hard guarantee, the bloom filter is small to keep in memory and helps ensure good performance of point-lookup queries, since all SST files do not need to be opened.