Google's F1 DDL Algorithm



So in our previous video, I asserted that DDL in TiDB is both online, and despite being a distributed system, is able to externalize schema changes everywhere at the same time. So how does it work?

The short version is that TiDB uses the methodology described in Google's Online, Asynchronous Schema Change in F1 paper. Let me try and summarize the methodology using an analogy:

  • You work for an international company with employees distributed across the world.
  • Your main communication channel with other employees is email.
  • The company has decided that it wants to switch all communication channels to Slack.
  • They want to both make the switch, and ensure that everyone makes the change at the same time.

So what is the best way to achieve this with everyone distributed?

Well, an easy way to do this is to announce the change well in advance. This way, everyone knows that at Noon UTC, it's time to switch to Slack and stop using email. And that is a part-solution to solving the DDL synchronization problem.

Let's complicate the scenario a little:

  • Your company uses Email.
  • It decided to switch to HipChat (posting notice in email).
  • It then decided to switch to Slack (posting notice in HipChat).

Let's also assume that as soon as it switched to Slack, it also shut down HipChat. In our naive scenario, we could have users on email that did not check-in quickly enough and now are not able to receive instruction in HipChat that they should now be using Slack.

So an improvement on our algorithm of “posting advance notice of schema changes” is to agree to check back in for schema changes at defined intervals. This is getting close to how DDL works in TiDB, where there will be a lease for a version of schema, with a set time to check back in for updates.

The other lesson that we can learn from this analogy is that when the company switched to HipChat, it should have forced all employees to be on it before switching to Slack. By agreeing to a lease time, there should only be a very brief window when both technologies are deployed which is the result of clock skew. Tolerating a small window allows the system to be online and not have a freeze where communication is not sent to email or HipChat.

The last lesson we can learn is that an individual change (such as adding an index) can be broken down into smaller incremental operations that are safe to transit from one step to the next. An example of this might be, if I add an index to a column, the first step should be to have a version of the schema that all TiDB servers upgrade to which starts maintaining the index. When it is ensured that all TiDB servers are maintaining the indexes, the next step is to sweep through and reading rows one-by-one to create the index. If we were not able to ensure all servers had a minimum version of the schema, there could be a rogue TiDB server which updated a row but didn't update the index. Thus our index would be corrupt. This staged deployment also helps introduce schema changes to existing queries that are in-flight, such as a long running update.

So that is the gist of the F1 paper, in my own words. What is also key to enforce here, is that the TiDB servers are both stateless, but also without a concept of membership. Having membership would make the scaling process more complicated, so schema change works on a concept of a lease instead. Our analogy of a distributed set of employees using email is useful here, in that in an alternative universe, employees might all be in the same office and you can just broadcast a schema change over the intercom system. To add a little more system detail: The meta data itself is stored in TiKV, but it is the PD server which is responsible for notifying TiDB instances that the schema version has changed.

If you would like to read in more detail, I recommend reading the F1 paper directly. It is included in the resources for this video. In the next step, we'll be performing DDL on our example database and verifying that it is indeed online.