MySQL 8.0 Reference Manual（讀書筆記38節-- 字元編碼(5)）

我們討論面試中各大廠的SQL演算法面試題，往往核心考點就在於視窗函數，所以掌握好了視窗函數，面對SQL演算法面試往往事半功倍。 ...

1.Supported Character Sets and Collations

To list the available character sets and their default collations, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.

mysql> SHOW CHARACTER SET;

In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.

2.Unicode Character Sets

2.1 Unicode Collation Algorithm (UCA) Versions

MySQL implements【ˈɪmplɪments 實施;執行;貫徹;使生效;】 the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The xxx_unicode_ci collations have only partial【ˈpɑːrʃl 部分的，不完全的;<數>偏的;偏向一方的，偏袒的，不公平的;偏愛的，癖好的;】 support for the Unicode Collation Algorithm. Some characters are not supported, and combining【kəmˈbaɪnɪŋ (使)結合，組合，聯合，混合;使融合(或並存);兼有;兼備;同時做(兩件或以上的事);兼做;兼辦;】 marks are not fully supported. This affects languages such as Vietnamese, Yoruba, and Navajo. A combined character is considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by the CHAR_LENGTH() function or in result set metadata).

Unicode collations based on UCA versions higher than 4.0.0 include the version in the collation name. Examples:

• utf8mb4_unicode_520_ci is based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/ UCA/5.2.0/allkeys.txt),

• utf8mb4_0900_ai_ci is based on UCA 9.0.0 weight keys (http://www.unicode.org/Public/UCA/9.0.0/ allkeys.txt).

The LOWER() and UPPER() functions perform case folding【ˈfoʊldɪŋ 可摺疊的;摺疊式的;】 according to the collation of their argument. A character that has uppercase and lowercase versions only in a Unicode version higher than 4.0.0 is converted by these functions only if the argument collation uses a high enough UCA version.

2.2 Collation Pad Attributes

Collations based on UCA 9.0.0 and higher are faster than collations based on UCA versions prior to 9.0.0. They also have a pad attribute of NO PAD, in contrast to PAD SPACE as used in collations based on UCA versions prior to 9.0.0. For comparison of nonbinary strings, NO PAD collations treat spaces at the end of strings like any other characte.

To determine the pad attribute for a collation, use the INFORMATION_SCHEMA COLLATIONS table, which has a PAD_ATTRIBUTE column. For example:

mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE
 FROM INFORMATION_SCHEMA.COLLATIONS
 WHERE CHARACTER_SET_NAME = 'utf8mb4';
+----------------------------+---------------+
| COLLATION_NAME             | PAD_ATTRIBUTE |
+----------------------------+---------------+
| utf8mb4_general_ci         | PAD SPACE     |
| utf8mb4_bin                | PAD SPACE     | 
| utf8mb4_unicode_ci         | PAD SPACE     |
| utf8mb4_icelandic_ci       | PAD SPACE     |
...
| utf8mb4_0900_ai_ci         | NO PAD        |
| utf8mb4_de_pb_0900_ai_ci   | NO PAD        |
| utf8mb4_is_0900_ai_ci      | NO PAD        |
...

Comparison of nonbinary string values (CHAR, VARCHAR, and TEXT) that have a NO PAD collation differ from other collations with respect【rɪˈspekt 尊敬，尊重，敬重;（事物的）方面;關係，關聯;註重，重視;敬意，問候;細節;著眼點;】 to trailing spaces. For example, 'a' and 'a ' compare as different strings, not the same string. This can be seen using the binary collations for utf8mb4. The pad attribute for utf8mb4_bin is PAD SPACE, whereas for utf8mb4_0900_bin it is NO PAD. Consequently, operations involving utf8mb4_0900_bin do not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collations:

mysql> CREATE TABLE t1 (c CHAR(10) COLLATE utf8mb4_bin);
Query OK, 0 rows affected (0.03 sec)
mysql> INSERT INTO t1 VALUES('a');
Query OK, 1 row affected (0.01 sec)
mysql> SELECT * FROM t1 WHERE c = 'a ';
+------+
| c    |
+------+
| a |
+------+
1 row in set (0.00 sec)
mysql> ALTER TABLE t1 MODIFY c CHAR(10) COLLATE utf8mb4_0900_bin;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM t1 WHERE c = 'a ';
Empty set (0.00 sec)

2.3 Language-Specific Collations

MySQL implements language-specific Unicode collations if the ordering based only on the Unicode Collation Algorithm (UCA) does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring【ˈteɪlərɪŋ 專門製作;訂做;】 rules.For questions about particular language orderings, unicode.org provides Common Locale Data Repository (CLDR) collation charts at http://www.unicode.org/cldr/charts/30/collation/index.html.

For example, the nonlanguage-specific utf8mb4_0900_ai_ci and language-specific utf8mb4_LOCALE_0900_ai_ci Unicode collations each have these characteristics:

• The collation is based on UCA 9.0.0 and CLDR v30, is accent-insensitive and case-insensitive. These characteristics are indicated by _0900, _ai, and _ci in the collation name. Exception: utf8mb4_la_0900_ai_ci is not based on CLDR because Classical Latin is not defined in CLDR.

• The collation works for all characters in the range [U+0, U+10FFFF].

• If the collation is not language specific, it sorts all characters, including supplementary characters, in default order (described following). If the collation is language specific, it sorts characters of the language correctly according to language-specific rules, and characters not in the language in default order.

• By default, the collation sorts characters having a code point listed in the DUCET table (Default Unicode Collation Element Table) according to the weight value assigned in the table. The collation sorts characters not having a code point listed in the DUCET table using their implicit weight value, which is constructed according to the UCA.

• For non-language-specific collations, characters in contraction sequences are treated as separate characters. For language-specific collations, contractions might change character sorting order.

2.4 _general_ci Versus _unicode_ci Collations

For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8mb4_general_ci collation are faster, but slightly less correct, than comparisons for utf8mb4_unicode_ci. The reason is that utf8mb4_unicode_ci supports mappings such as expansions【ɪkˈspænʃənz 膨脹;擴展;擴張;擴大;】; that is, when one character compares as equal to combinations of other characters. For example, ß is equal to ss in German and some other languages. utf8mb4_unicode_ci also supports contractions【kənˈtrækʃənz 收縮;(肌肉的)收縮，攣縮;縮小;】 and ignorable characters. utf8mb4_general_ci is a legacy【ˈleɡəsi 遺產;遺留;後遺症;遺贈財物;】 collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

MySQL implements language-specific Unicode collations if the ordering with utf8mb4_unicode_ci does not work well for a language. For example, utf8mb4_unicode_ci works fine for German dictionary order and French, so there is no need to create special utf8mb4 collations. utf8mb4_general_ci also is satisfactory for both German and French, except that ß is equal to s, and not to ss. If this is acceptable for your application, you should use utf8mb4_general_ci because it is faster. If this is not acceptable (for example, if you require German dictionary order), use utf8mb4_unicode_ci because it is more accurate.

2.5 Character Collating Weights

A character's collating weight is determined as follows:

• For all Unicode collations except the _bin (binary) collations, MySQL performs a table lookup to find a character's collating weight.

• For _bin collations except utf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.

• For utf8mb4_0900_bin, the weight is the utf8mb4 encoding bytes. The sort order is the same as for utf8mb4_bin, but much faster.

Collating weights can be displayed using the WEIGHT_STRING() function. If a collation uses a weight lookup table, but a character is not in the table (for example, because it is a “new” character), collating weight determination becomes more complex:

• For BMP characters in general collations (xxx_general_ci), the weight is the code point.

• For BMP characters in UCA collations (for example, xxx_unicode_ci and language-specific collations), the following algorithm applies:

if (code >= 0x3400 && code <= 0x4DB5)
 base= 0xFB80; /* CJK Ideograph Extension */
else if (code >= 0x4E00 && code <= 0x9FA5)
 base= 0xFB40; /* CJK Ideograph */
else
 base= 0xFBC0; /* All other characters */
aaaa= base + (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;

• For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA 4.0.0 collations, their collating weight is 0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.

The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:

ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin

• For supplementary characters based on UCA versions higher than 4.0.0 (for example, xxx_unicode_520_ci), supplementary characters do not necessarily all have the same collating weight. Some have explicit weights from the UCA allkeys.txt file. Others have weights calculated from this algorithm:

aaaa= base + (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;

There is a difference between “ordering by the character's code value” and “ordering by the character's binary representation,” a difference that appears only with utf16_bin, because of surrogates.

3. West European Character Sets

Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

4. Central European Character Sets

MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, Poland, and Serbia (Latin).

5.South European and Middle East Character Sets

South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.

6.Baltic Character Sets

The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.

7.Cyrillic Character Sets

The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, Ukrainian, and Serbian (Cyrillic) languages.

8.Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters.

• big5 (Big5 Traditional Chinese) collations:

big5_bin
big5_chinese_ci (default)

The gb18030 Character Set

In MySQL, the gb18030 character set corresponds to the “Chinese National Standard GB 18030-2005: Information technology—Chinese coded character set”, which is the official character set of the People's Republic of China (PRC).

Characteristics of the MySQL gb18030 Character Set

• Supports all code points defined by the GB 18030-2005 standard. Unassigned code points in the ranges (GB+8431A439, GB+90308130) and (GB+E3329A36, GB+EF39EF39) are treated as '?' (0x3F). Conversion of unassigned code points return '?'.

• Supports UPPER and LOWER conversion for all GB18030 code points. Case folding defined by Unicode is also supported (based on CaseFolding-6.3.0.txt).

• Supports Conversion of data to and from other character sets.

• Supports SQL statements such as SET NAMES.

• Supports comparison between gb18030 strings, and between gb18030 strings and strings of other character sets. There is a conversion if strings have different character sets. Comparisons that include or ignore trailing spaces are also supported.

• The private use area (U+E000, U+F8FF) in Unicode is mapped to gb18030.

• There is no mapping between (U+D800, U+DFFF) and GB18030. Attempted conversion of code points in this range returns '?'.

• If an incoming sequence is illegal, an error or warning is returned. If an illegal sequence is used in CONVERT(), an error is returned. Otherwise, a warning is returned.

• For consistency with utf8mb3 and utf8mb4, UPPER is not supported for ligatures.

• Searches for ligatures also match uppercase ligatures when using the gb18030_unicode_520_ci collation.

• If a character has more than one uppercase character, the chosen uppercase character is the one whose lowercase is the character itself.

• The minimum multibyte length is 1 and the maximum is 4. The character set determines the length of a sequence using the first 1 or 2 bytes.

Supported Collations

• gb18030_bin: A binary collation.

• gb18030_chinese_ci: The default collation, which supports Pinyin. Sorting of non-Chinese characters is based on the order of the original sort key. The original sort key is GB(UPPER(ch)) if UPPER(ch) exists. Otherwise, the original sort key is GB(ch). Chinese characters are sorted according to the Pinyin collation defined in the Unicode Common Locale Data Repository (CLDR 24). Non-Chinese characters are sorted before Chinese characters with the exception of GB+FE39FE39, which is the code point maximum.

• gb18030_unicode_520_ci: A Unicode collation. Use this collation if you need to ensure that ligatures are sorted correctly.

9.The Binary Character Set

The binary character set is the character set for binary strings, which are sequences of bytes. The binary character set has one collation, also named binary. Comparison and sorting are based on numeric byte values, rather than on numeric character code values (which for multibyte characters differ from numeric byte values). For information about the differences between the binary collation of the binary character set and the _bin collations of nonbinary character sets.

For the binary character set, the concepts of lettercase and accent equivalence【ɪˈkwɪvələns (用途、功能、尺寸、價值等)相等，對等，相同;】 do not apply:

• For single-byte characters stored as binary strings, character and byte boundaries【ˈbaʊndəriz 邊界;界限;分界線;使球越過邊界線的擊球(得加分);】 are the same, so lettercase and accent differences are significant【sɪɡˈnɪfɪkənt 重要的, 有重大意義的;顯著的, 值得註意的;<統>顯著的, 有效的;（詞綴等）有意義的;不可忽略的, 值得註意的;相當數量的;別有含義的, 意味深長的;（語言上）區別性的;】 in comparisons. That is, the binary collation is case-sensitive and accent-sensitive.

mysql> SET NAMES 'binary';
mysql> SELECT CHARSET('abc'), COLLATION('abc');
+----------------+------------------+
| CHARSET('abc') | COLLATION('abc') |
+----------------+------------------+
| binary         | binary           |
+----------------+------------------+
mysql> SELECT 'abc' = 'ABC', 'a' = 'ä';
+---------------+------------+
| 'abc' = 'ABC' | 'a' = 'ä'  |
+---------------+------------+
| 0             | 0          |
+---------------+------------+

• For multibyte characters stored as binary strings, character and byte boundaries differ. Character boundaries are lost, so comparisons that depend on them are not meaningful.

To perform lettercase conversion of a binary string, first convert it to a nonbinary string using a character set appropriate for the data stored in the string:

mysql> SET @str = BINARY 'New York';
mysql> SELECT LOWER(@str), LOWER(CONVERT(@str USING utf8mb4));
+-------------+------------------------------------+
| LOWER(@str) | LOWER(CONVERT(@str USING utf8mb4)) |
+-------------+------------------------------------+
| New York    | new york                           |
+-------------+------------------------------------+

To convert a string expression to a binary string, these constructs are equivalent:

BINARY expr
CAST(expr AS BINARY)
CONVERT(expr USING BINARY)

9.MySQL Server Locale Support

The locale indicated by the lc_time_names system variable controls the language used to display day and month names and abbreviations. This variable affects the output from the DATE_FORMAT(), DAYNAME(), and MONTHNAME() functions.

lc_time_names does not affect the STR_TO_DATE() or GET_FORMAT() function.

The lc_time_names value does not affect the result from FORMAT(), but this function takes an optional third parameter that enables a locale to be specified to be used for the result number's decimal point, thousands separator, and grouping between separators. Permissible locale values are the same as the legal values for the lc_time_names system variable

Locale names have language and region subtags listed by IANA (http://www.iana.org/assignments/ language-subtag-registry) such as 'ja_JP' or 'pt_BR'. The default value is 'en_US' regardless of your system's locale setting, but you can set the value at server startup, or set the GLOBAL value at runtime if you have privileges sufficient to set global system variables.

Any client can examine the value of lc_time_names or set its SESSION value to affect the locale for its own connection.

10.Character Set Configuration

The MySQL server has a compiled-in default character set and collation. To change these defaults, use the --character-set-server and --collation-server options when you start the server.The collation must be a legal【ˈliːɡl 合法的;法律的;法律要求的;法律允許的;與法律有關的;】 collation for the default character set. To determine which collations are available for each character set, use the SHOW COLLATION statement or query the INFORMATION_SCHEMA COLLATIONS table.

If you try to use a character set that is not compiled into your binary, you might run into the following problems:

• If your program uses an incorrect path to determine where the character sets are stored (which is typically the share/mysql/charsets or share/charsets directory under the MySQL installation directory), this can be fixed by using the --character-sets-dir option when you run the program.

• If the character set is a complex character set that cannot be loaded dynamically, you must recompile the program with support for the character set.

• If the character set is a dynamic character set, but you do not have a configuration file for it, you should install the configuration file for the character set from a new MySQL distribution. • If your character set index file (Index.xml) does not contain the name for the character set, your program displays an error message:

Character set 'charset_name' is not a compiled character set and is not
specified in the '/usr/share/mysql/charsets/Index.xml' file

To solve this problem, you should either get a new index file or manually add the name of any missing character sets to the current file.

You can force client programs to use specific character set as follows:

[client]
default-character-set=charset_name

This is normally unnecessary. However, when character_set_system differs from character_set_server or character_set_client, and you input characters manually (as database object identifiers, column values, or both), these may be displayed incorrectly in output from the client or the output itself may be formatted incorrectly. In such cases, starting the mysql client with --default-character-set=system_character_set—that is, setting the client character set to match the system character set—should fix the problem.

11. Restrictions on Character Sets

• Identifiers are stored in mysql database tables (user, db, and so forth) using utf8mb3, but identifiers can contain only characters in the Basic Multilingual Plane (BMP). Supplementary characters are not permitted in identifiers.

• The ucs2, utf16, utf16le, and utf32 character sets have the following restrictions:

None of them can be used as the client character set.
It is currently not possible to use LOAD DATA to load data files that use these character sets.
FULLTEXT indexes cannot be created on a column that uses any of these character sets. However, you can perform IN BOOLEAN MODE searches on the column without an index.

• The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

12.Setting the Error Message Language

By default, mysqld produces error messages in English, but they can be displayed instead in any of several other languages: Czech, Danish, Dutch, Estonian, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Norwegian-ny, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, or Swedish. This applies to messages the server writes to the error log and sends to clients.

The server searches for the error message file using these rules:

• It looks for the file in a directory constructed from two system variable values, lc_messages_dir and lc_messages, with the latter converted to a language name. Suppose that you start the server using this command:

mysqld --lc_messages_dir=/usr/share/mysql --lc_messages=fr_FR

In this case, mysqld maps the locale fr_FR to the language french and looks for the error file in the / usr/share/mysql/french directory.

By default, the language files are located in the share/mysql/LANGUAGE directory under the MySQL base directory.

• If the message file cannot be found in the directory constructed as just described, the server ignores the lc_messages value and uses only the lc_messages_dir value as the location in which to look.

• If the server cannot find the configured message file, it writes a message to the error log to indicate the problem and defaults to built-in English messages.

The lc_messages_dir system variable can be set only at server startup and has only a global read-only value at runtime. lc_messages can be set at server startup and has global and session values that can be modified at runtime. Thus, the error message language can be changed while the server is running, and each client can have its own error message language by setting its session lc_messages value to the desired locale name. For example, if the server is using the fr_FR locale for error messages, a client can execute this statement to receive error messages in English:

SET lc_messages = 'en_US';