Character Set and Collation

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

Currently, TiDB supports the following character sets:

SHOW CHARACTER SET;
+---------|---------------|-------------------|--------+
| Charset | Description   | Default collation | Maxlen |
+---------|---------------|-------------------|--------+
| utf8    | UTF-8 Unicode | utf8_bin          |      3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin       |      4 |
| ascii   | US ASCII      | ascii_bin         |      1 |
| latin1  | Latin1        | latin1_bin        |      1 |
| binary  | binary        | binary            |      1 |
+---------|---------------|-------------------|--------+
5 rows in set (0.00 sec)

Note:

Each character set might correspond to multiple collations, but by default each character set corresponds to only one collation.

You can use the following statement to view the collations (under the new framework for collations) that corresponds to the character set.

SHOW COLLATION WHERE Charset = 'utf8mb4';
+--------------------+---------+------+---------+----------+---------+
| Collation          | Charset | Id   | Default | Compiled | Sortlen |
+--------------------+---------+------+---------+----------+---------+
| utf8mb4_bin        | utf8mb4 |   46 | Yes     | Yes      |       1 |
| utf8mb4_general_ci | utf8mb4 |   45 |         | Yes      |       1 |
+--------------------+---------+------+---------+----------+---------+
2 rows in set (0.00 sec)

Cluster character set and collation

Not supported yet.

Database character set and collation

Each database has a character set and a collation. You can use the following statements to specify the database character set and collation:

CREATE DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]

ALTER DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]

DATABASE can be replaced with SCHEMA here.

Different databases can use different character sets and collations. Use the character_set_database and collation_database to see the character set and collation of the current database:

create schema test1 character set utf8mb4 COLLATE uft8mb4_general_ci;
Query OK, 0 rows affected (0.09 sec)
use test1;
Database changed
SELECT @@character_set_database, @@collation_database;
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
| utf8mb4                  | uft8mb4_general_ci   |
+--------------------------|----------------------+
1 row in set (0.00 sec)
create schema test2 character set latin1 COLLATE latin1_bin;
Query OK, 0 rows affected (0.09 sec)
use test2;
Database changed
SELECT @@character_set_database, @@collation_database;
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
| latin1                   | latin1_bin           |
+--------------------------|----------------------+
1 row in set (0.00 sec)

You can also see the two values in INFORMATION_SCHEMA:

SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name';

Table character set and collation

You can use the following statement to specify the character set and collation for tables:

CREATE TABLE tbl_name (column_list)
    [[DEFAULT] CHARACTER SET charset_name]
    [COLLATE collation_name]]

ALTER TABLE tbl_name
    [[DEFAULT] CHARACTER SET charset_name]
    [COLLATE collation_name]

For example:

CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.08 sec)

If the table character set and collation are not specified, the database character set and collation are used as their default values.

Column character set and collation

You can use the following statement to specify the character set and collation for columns:

col_name {CHAR | VARCHAR | TEXT} (col_length)
    [CHARACTER SET charset_name]
    [COLLATE collation_name]

col_name {ENUM | SET} (val_list)
    [CHARACTER SET charset_name]
    [COLLATE collation_name]

If the column character set and collation are not specified, the table character set and collation are used as their default values.

String character sets and collation

Each string corresponds to a character set and a collation. When you use a string, this option is available:

[_charset_name]'string' [COLLATE collation_name]

Example:

SELECT 'string';
SELECT _utf8mb4'string';
SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci;

Rules:

  • Rule 1: If you specify CHARACTER SET charset_name and COLLATE collation_name, then the charset_name character set and the collation_name collation are used directly.
  • Rule 2: If you specify CHARACTER SET charset_name but do not specify COLLATE collation_name, the charset_name character set and the default collation of charset_name are used.
  • Rule 3: If you specify neither CHARACTER SET charset_name nor COLLATE collation_name, the character set and collation given by the system variables character_set_connection and collation_connection are used.

Client connection character set and collation

  • The server character set and collation are the values of the character_set_server and collation_server system variables.

  • The character set and collation of the default database are the values of the character_set_database and collation_database system variables.

You can use character_set_connection and collation_connection to specify the character set and collation for each connection. The character_set_client variable is to set the client character set.

Before returning the result, the character_set_results system variable indicates the character set in which the server returns query results to the client, including the metadata of the result.

You can use the following statement to set the character set and collation that is related to the client:

  • SET NAMES 'charset_name' [COLLATE 'collation_name']

    SET NAMES indicates what character set the client will use to send SQL statements to the server. SET NAMES utf8mb4 indicates that all the requests from the client use utf8mb4, as well as the results from the server.

    The SET NAMES 'charset_name' statement is equivalent to the following statement combination:

    SET character_set_client = charset_name;
    SET character_set_results = charset_name;
    SET character_set_connection = charset_name;
    

    COLLATE is optional, if absent, the default collation of the charset_name is used.

  • SET CHARACTER SET 'charset_name'

    Similar to SET NAMES, the SET NAMES 'charset_name' statement is equivalent to the following statement combination:

    SET character_set_client = charset_name;
    SET character_set_results = charset_name;
    SET collation_connection = @@collation_database;
    

Optimization levels of character sets and collations

String > Column > Table > Database > Server > Cluster

General rules on selecting character sets and collation

  • Rule 1: If you specify CHARACTER SET charset_name and COLLATE collation_name, then the charset_name character set and the collation_name collation are used directly.
  • Rule 2: If you specify CHARACTER SET charset_name and do not specify COLLATE collation_name, then the charset_name character set and the default collation of charset_name are used.
  • Rule 3: If you specify neither CHARACTER SET charset_name nor COLLATE collation_name, the character set and collation with higher optimization levels are used.

Validity check of characters

If the specified character set is utf8 or utf8mb4, TiDB only supports the valid utf8 characters. For invalid characters, TiDB reports the incorrect utf8 value error. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions.

To disable this error reporting, use set @@tidb_skip_utf8_check=1; to skip the character check.

Collation support framework

The syntax support and semantic support for the collation are influenced by the new_collations_enabled_on_first_bootstrap configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings.

Before v4.0, TiDB provides only the old framework for collations. In this framework, TiDB supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations.

Since v4.0, TiDB supports a new framework for collations. In this framework, TiDB semantically parses different collations and strictly follows the collations when comparing strings.

Old framework for collations

Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB deletes the space at the end of the character according to the PADDING attribute of the collation before comparing characters, which causes the following behavior differences:

create table t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci primary key);
Query OK, 0 rows affected
insert into t values ('A');
Query OK, 1 row affected
insert into t values ('a');
Query OK, 1 row affected # In MySQL, because utf8mb4_general_ci is case-insensitive, the `Duplicate entry 'a'` error is reported.
insert into t1 values ('a ');
Query OK, 1 row affected # In MySQL, because comparison is performed after the spaces are filled in, the `Duplicate entry 'a '` error is returned.

New framework for collations

In TiDB 4.0, a complete framework for collations is introduced. This new framework supports semantically parsing collations and introduces the new_collations_enabled_on_first_bootstrap configuration item to decide whether to enable the new framework when a cluster is first initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the new_collation_enabled variable in the mysql.tidb table:

select VARIABLE_VALUE from mysql.tidb where VARIABLE_NAME='new_collation_enabled';
+----------------+
| VARIABLE_VALUE |
+----------------+
| True           |
+----------------+
1 row in set (0.00 sec)

Under the new framework, TiDB support the utf8_general_ci and utf8mb4_general_ci collations which are compatible with MySQL.

When utf8_general_ci or utf8mb4_general_ci is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's PADDING behavior:

create table t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci primary key);
Query OK, 0 rows affected (0.00 sec)
insert into t values ('A');
Query OK, 1 row affected (0.00 sec)
insert into t values ('a');
ERROR 1062 (23000): Duplicate entry 'a' for key 'PRIMARY'
insert into t values ('a ');
ERROR 1062 (23000): Duplicate entry 'a ' for key 'PRIMARY'

Note:

The implementation of padding in TiDB is different from that in MySQL. In MySQL, padding is implemented by filling in spaces. In TiDB, padding is implemented by cutting out the spaces at the end. The two approaches are the same in most cases. The only exception is when the end of the string contains characters that are less than spaces (0x20). For example, the result of 'a' < 'a\t' in TiDB is 1, but in MySQL, 'a' < 'a\t' is equivalent to 'a ' < 'a\t', and the result is 0.

Coercibility values of collations in expressions

If an expression involves multiple clauses of different collations, you need to infer the collation used in the calculation. The rules are as follows:

  • The coercibility value of the explicit COLLATE clause is 0.
  • If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is 1. Currently, all implemented collations are compatible with each other.
  • The column's collation has a coercibility value of 2.
  • The system constant (the string returned by USER () or VERSION ()) has a coercibility value of 3.
  • The coercibility value of constants is 4.
  • The coercibility value of numbers or intermediate variables is 5.
  • NULL or expressions derived from NULL has a coercibility value of 6.

When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority:

binary > utf8mb4_bin > utf8mb4_general_ci > utf8_bin > utf8_general_ci > latin1_bin > ascii_bin

If the collations of two clauses are different and the coercibility value of both clauses is 0, TiDB cannot infer the collation and reports an error.

COLLATE clause

TiDB supports using the COLLATE clause to specify the collation of an expression. The coercibility value of this expression is 0, which has the highest priority. See the following example:

select 'a' = 'A' collate utf8mb4_general_ci;
+--------------------------------------+
| 'a' = 'A' collate utf8mb4_general_ci |
+--------------------------------------+
|                                    1 |
+--------------------------------------+
1 row in set (0.00 sec)

For more details, see Connection Character Sets and Collations.