Introduction to Distributed Databases


In introducing TiDB, I want to start by taking a quick tour of databases and answer what TiDB is, and why it exists.

Our story begins in the 1960s with the first generation of databases. They were not yet relational, and the query language that was used to access records was closely tied to the storage format. So this meant that there was very little in the way of vendor independence, and changing the storage format (perhaps to add an optimization) necessitated changing how you accessed the data.

In the early 1970s, we saw the invention of the relational model and SQL. This invention addressed many of the issues of the earlier systems, and presented an abstraction that was more friendly to developers. Queries were now possible in independent higher level language, and not tied to the physical implementation. While it is important to mention that SQL is a living language and has had many updates to the standard, the origin of many of the relational databases we use today can be traced back to this point.

From the 1970s to the early 2000s, SQL-based relational databases dominated. While SQL was “standard”, many of the vendors also offered proprietary stored procedure languages that were highly optimized, but reduced the portability between platforms. While I would be hesitant to call any of these systems slow, many of them were inherently designed to be run on a single server. And since they trace back to the 1970s, this should not be surprising. At the time it was uncommon for a company to have very many computers, let alone dedicated servers just for databases.

But the single server design really started to become problematic with the rise of the Internet. The canonical answer to it, at least in the early 2000s was to use sharding + read replicas, often with MySQL as the database. And while I have fond memories of these times personally, I think it is important to clarify that this was not really a feature of the database, but a feature of the application code that knew which was the correct database server to connect to for a particular query. It also relied on the developer carefully planning their table design because cross-shard queries were either extremely difficult or not possible. So in some respects the limits of the 1960s had unfortunately returned. Often a cache was also used such as memcached, which helped alleviate load on the database. But the cache itself was a lookaside cache, and pushed more logic into the application.

Around the mid 2000s we saw the introduction of NoSQL databases. And while I want to clarify that NoSQL as a term really covers a broad family of technologies, a common thread between them was addressing the scalability issues of traditional relational databases. By (in many cases) changing the data model slightly, and reducing some of the data consistency guarantees of the system, they were able to scale horizontally by just adding nodes.

NoSQL is typically now used to refer to “not only SQL”, since some of these systems do offer query languages that closely resemble SQL anyway. But what I think is important to mention here, is much like sharding where there is complexity that needs to be managed in application code, some of the guarantees that NoSQL databases took away also led to additional application code.

I think as an industry we often use banks or money incorrectly as the only system that requires transactions and durability. But consider a hypothetical application that has either user permissions or privacy settings;

If I, as a user make a change to my settings and then see a web page come back and say that the changes have been applied, I do not expect that this database activity will be lost because the database crashed 2 seconds later. In ecommerce, a user might place an order and then receive an email confirmation saying that the order was successful. But because the system crashed, the order was lost.

So in both cases the scalability needs pushed back responsibility to those that developed and managed the system. Developers now had a choice of which set of trade-offs they were more willing to accept.

The next generation of systems that we are now seeing emerge are commonly referred to as NewSQL, in essence SQL-based relational databases with the scalability characteristics of NoSQL. Many of these systems draw inspiration from Google Spanner, which was really amongst the first of the horizontally scalable systems with strong consistency guarantees - a departure from NoSQL.

This is the family of systems that TiDB belongs to. TiDB uses a Spanner-inspired architecture and speaks the MySQL network protocol, so applications can connect to it using an existing MySQL connector. While there are some semantics differences in TiDB to make it distributed, for application developers using it, it is much closer to the traditional relational databases spanning back to the 1970s than it is to NoSQL databases.

One such change, is that TiDB uses optimistic locking rather than the more traditional pessimistic locking used by InnoDB. Let me show you a quick example of a behavior difference that might surprise you:

create table t1 (c1 int);
create table t2 (c2 int);
create table t3 (c3 int);

insert into t1 values(50);
insert into t2 values(0);
insert into t3 values(0);

set autocommit=0;
‘’' transaction 1 ‘’'
  update t1 set c1 = c1 - 50 where c1 >= 50;
  ‘’' Success, affected rows is 1 ‘’'
  update t2 set c2=c2 + 50;

// transaction 2
  update t1 set c1 = c1 - 10 where c1 >= 10;
  ' Success, affected rows is 1 !!!!, but notice here, the real affected rows is 0, if client uses affected rows as the successful signature of this transaction, but it’s wrong because tidb will retry this transaction.
  update t3 set c3 = c3 + 10;

Due to optimistic locking, the affected rows may differ from MySQL to TiDB and your application should also expect that COMMIT may result in an error.

We will have a dedicated section that looks at the compatibility between TiDB and MySQL, but for now let’s continue the introduction to the TiDB Platform.