<! Title:記一次MySQL主從同步延遲處理 Date:2019.09.25 Key:MySQL,gtid,relay 在MySQL官方版本中,為了保證其的高可用性,一般情況我們會採用主從複製的方式來解決。當然,方法很多。而我們今天所要處理的是採用GTID方式並且開了多線程複製後,仍然延遲的情 ...
在MySQL官方版本中,為了保證其的高可用性,一般情況我們會採用主從複製的方式來解決。當然,方法很多。而我們今天所要處理的是採用GTID方式並且開了多線程複製後,仍然延遲的情況,糟糕的是,延遲還在不斷擴大!
序號 | 清單 | 說明 |
1 | 系統 | Redhat 6.x(4c,32g) |
2 | 資料庫 | MySQL-5.7.25 |
3 | 同步方式 | 基於GTID主從同步 |
1)已經配置的重要參數:
relay for slave
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 6
master_info_repository = TABLE
relay_log_info_repository = TABLE
relay_log_recovery = on
sync_relay_log = 10000
註:此時沒有設置
slave_preserve_commit_order
參數。
2)從庫延遲狀態查詢
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Host: xxx.xxx.xxx.xxx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.008978
Read_Master_Log_Pos: 696914605
Relay_Log_File: DB41-relay-bin.001259
Relay_Log_Pos: 207377582
Relay_Master_Log_File: mysql-bin.008970
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: neteagle3
Replicate_Ignore_DB: mysql
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1068770059
Relay_Log_Space: 8425484286
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 187358
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 42
Master_UUID: eab7fcac-3cda-11e6-ada8-fa163e648db2
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:58031191-59927276
Executed_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:1-58080239:58080241
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
簡單介紹一下幾個指標信息:
- Master_Log_File
- Read_Master_Log_Pos
- Seconds_Behind_Master
- Relay_Log_File
- Relay_Log_Pos
- Relay_Master_Log_File
- Exec_Master_Log_Pos
Master_Log_File, Read_Master_Log_Pos:
這兩個參數是成對的,表示的是從庫IO thread傳輸主庫的binlog日誌號及具體位置。
Relay_Log_File, Relay_Log_Pos:
這兩個參數也是成對的,表示的是從庫sql thread應用中繼日誌(relay log)號及具體位置。
Relay_Master_Log_File, Relay_Log_Pos:
這兩個參數也是成對的,表示的是上一項中的中繼日誌對應的主庫binlog日誌及具體位置(有點繞)。
Seconds_Behind_Master:
此參數可簡單理解為主從延遲時間,單位為秒。
從上面這段MySQL從狀態信息中,我們可以看到,Seconds_Behind_Master: 187358
這是從庫sql應用延遲主庫的時間為187358秒,轉換成天,大概兩天多。這說明,我們從庫複製的數據是兩天前的。
3)驗證並行複製
mysql> show full processlist;
---- ------------- -------------- ----------- ------------------ -------- --------------------------------------------------------------- -----------------------
| Id | User | Host | db | Command | Time | State | Info |
---- ------------- -------------- ----------- ------------------ -------- --------------------------------------------------------------- -----------------------
| 1 | system user | | NULL | Connect | 18204 | Waiting for master to send event | NULL |
| 2 | system user | | NULL | Connect | 0 | Waiting for dependent transaction to commit | NULL |
| 3 | system user | | NULL | Connect | 154914 | System lock | NULL |
| 4 | system user | | NULL | Connect | 154914 | Waiting for an event from Coordinator | NULL |
| 5 | system user | | NULL | Connect | 154918 | Waiting for an event from Coordinator | NULL |
| 6 | system user | | NULL | Connect | 155525 | Waiting for an event from Coordinator | NULL |
| 7 | system user | | NULL | Connect | 180427 | Waiting for an event from Coordinator | NULL |
| 8 | system user | | NULL | Connect | 18204 | Waiting for an event from Coordinator | NULL |
| 10 | root | localhost | neteagle3 | Query | 0 | starting | show full processlist |
| 11 | repl | DBSlave:9683 | NULL | Binlog Dump GTID | 18156 | Master has sent all binlog to slave; waiting for more updates | NULL |
| 13 | root | localhost | neteagle3 | Sleep | 4962 | | NULL |
---- ------------- -------------- ----------- ------------------ -------- --------------------------------------------------------------- -----------------------
mysql> select * from performance_schema.replication_applier_status_by_worker ;
-------------- ----------- ----------- --------------- ----------------------------------------------- ------------------- -------------------- ----------------------
| CHANNEL_NAME | WORKER_ID | THREAD_ID | SERVICE_STATE | LAST_SEEN_TRANSACTION | LAST_ERROR_NUMBER | LAST_ERROR_MESSAGE | LAST_ERROR_TIMESTAMP |
-------------- ----------- ----------- --------------- ----------------------------------------------- ------------------- -------------------- ----------------------
| | 1 | 51 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240805 | 0 | | 0000-00-00 00:00:00 |
| | 2 | 52 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240210 | 0 | | 0000-00-00 00:00:00 |
| | 3 | 53 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80235089 | 0 | | 0000-00-00 00:00:00 |
| | 4 | 54 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80191268 | 0 | | 0000-00-00 00:00:00 |
| | 5 | 55 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:75296683 | 0 | | 0000-00-00 00:00:00 |
| | 6 | 56 | ON | | 0 | | 0000-00-00 00:00:00 |
-------------- ----------- ----------- --------------- ----------------------------------------------- ------------------- -------------------- ----------------------
6 rows in set (0.00 sec)
通過本條查詢,可以看到開啟了6個並行進行複製。
通過以上來看,所有一切似乎都正常,並行複製開了,並且查看到CPU,IO,記憶體均沒有達到瓶頸地步。主庫寫binlog日誌大概為2MB/s,這樣的日質量並不是非常高。
從庫也同樣查看了是否存在鎖的情況,也沒有發現。
因此反覆的在查看slave狀態,看能否發現一些細節,結果還真看到了一些異常現象。Relay_Log_Pos
這個參數在頻繁的刷slave狀態時,發現時常會卡著不動(此時已確認沒有看到鎖)。或許問題真正的原因正在這裡!
分析binlog或者relay log日誌,看有啥線索:
[mysql@xxx data]$ mysqlbinlog --no-defaults -v -v --base64-output=DECODE-ROWS relay-bin.001384 --start-position=420090430|more
/!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1/;
/!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0/;
DELIMITER /!/;
# at 420090430
190923 9:24:28 server id 42 end_log_pos 420090282 CRC32 0xd9097eaf GTID last_committed=57148 sequence_number=57149 rbr_only=yes
/!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED//!/;
SET @@SESSION.GTID_NEXT= 'eab7fcac-3cda-11e6-ada8-fa163e648db2:69415610'/!/;
# at 420090495
190923 9:24:28 server id 42 end_log_pos 420090364 CRC32 0x82b57dfd Query thread_id=95 exec_time=0 error_code=0
SET TIMESTAMP=1569201868/!/;
SET @@session.pseudo_thread_id=95/!/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/!/;
SET @@session.sql_mode=1075838976/!/;
SET @@session.auto_increment_increment=2, @@session.auto_increment_offset=1/!/;
/!\C gbk //!/;
SET @@session.character_set_client=28,@@session.collation_connection=28,@@session.collation_server=8/!/;
SET @@session.lc_time_names=0/!/;
SET @@session.collation_database=DEFAULT/!/;
BEGIN
/!/;
# at 420090577
190923 9:24:28 server id 42 end_log_pos 420090585 CRC32 0x752e27cf Table_map: net
.f_event
mapped to number 108
# at 420090798
190923 9:24:28 server id 42 end_log_pos 420090812 CRC32 0x72b8e10d Table_map: net
.f_eventstorage
mapped to number 245
# at 420091025
190923 9:24:28 server id 42 end_log_pos 420091039 CRC32 0x1797f9d8 Table_map: net
.f_eventstorage
mapped to number 245
# at 420091252
190923 9:24:28 server id 42 end_log_pos 420091106 CRC32 0x8af14ad2 Table_map: net
.f_eventdetail
mapped to number 243
# at 420091319
190923 9:24:28 server id 42 end_log_pos 420091177 CRC32 0xf1ce87c8 Table_map: net
.f_eventoperation
mapped to number 244
# at 420091390
190923 9:24:28 server id 42 end_log_pos 420091244 CRC32 0x586c0b9d Table_map: net
.f_eventaudit
mapped to number 242
# at 420091457
190923 9:24:28 server id 42 end_log_pos 420093382 CRC32 0x505e5408 Update_rows: table id 108
# at 420093595
190923 9:24:28 server id 42 end_log_pos 420098858 CRC32 0x0f404509 Update_rows: table id 245
# at 420099071
190923 9:24:28 server id 42 end_log_pos 420098910 CRC32 0xb8d9ed15 Write_rows: table id 243
# at 420099123
190923 9:24:28 server id 42 end_log_pos 420098966 CRC32 0x3c489a7f Write_rows: table id 244 flags: STMT_END_F
我們查看了中繼日誌relay-bin.001384
卡住的位置號420090430,為設置GTID_NEXT,信息沒什麼用。
繼續看在卡住時刻,資料庫open的是什麼表?
mysql> show open tables where In_use=1;
----------- --------------------- -------- -------------
| Database | Table | In_use | Name_locked |
----------- --------------------- -------- -------------
| net | f_currentxxx | 1 | 0 |
----------- --------------------- -------- -------------
這張表有什麼特殊的麽?查看其表結構
mysql> show create table net.f_currentxxx\G
*************************** 1. row ***************************
Table: f_currentxxx
Create Table: CREATE TABLE f_currentxxx
(
serial
int(20) NOT NULL COMMENT 'xxx',
audittime
bigint(20) NOT NULL COMMENT 'xxx',
type
int(11) DEFAULT NULL COMMENT 'xxx',
severity
int(11) DEFAULT NULL COMMENT 'xxx,
KEY audittime
(audittime
)
) ENGINE=MEMORY DEFAULT CHARSET=gbk COMMENT='xxx'
1 row in set (0.00 sec)
有沒有看到什麼與眾不同?
沒錯,就是表的存儲引擎ENGINE=MEMORY
。MEMORY的表進行主從複製,首先來說如果從庫不做查詢,一點意義沒有,另外對Memory表做複製,性能是非常堪憂的。如果必須進行同步,考慮將表的存儲引擎改為InnoDB
mysql> select table_name from information_schema.tables where TABLE_SCHEMA='net' and ENGINE='memory';
----------------------
| table_name |
----------------------
| f_currentxxx |
----------------------
1 row in set, 6 warnings (0.01 sec)
出於嚴謹,我們將要複製的資料庫進行徹底排查,確實只有這一張表是Memory存儲引擎。
停止複製進程,將選項中添加replicate-ignore-table=net.f_currentxxx
,重新啟動複製進程,觀察slave狀態。
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: xxx.xxx.xxx.xxx
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.009194
Read_Master_Log_Pos: 939698255
Relay_Log_File: relay-bin.001964
Relay_Log_Pos: 444060572
Relay_Master_Log_File: mysql-bin.009027
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: net
Replicate_Ignore_DB: mysql
Replicate_Do_Table:
Replicate_Ignore_Table: net.f_currentxxx
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 444060359
Relay_Log_Space: 180287882098
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 179221
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 42
Master_UUID: eab7fcac-3cda-11e6-ada8-fa163e648db2
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:69497322-107886661
Executed_Gtid_Set: 1264a536-da12-11e9-81ea-005056856ba5:1,
eab7fcac-3cda-11e6-ada8-fa163e648db2:1-71980857
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
我們可以看到net.f_currentxxx表已經被忽略複製。持續觀察一段時間後,Seconds_Behind_Master在逐漸縮小。
中繼日誌應用速度大約5分鐘一個(每個中繼日誌為1GB大小),而主庫binlog日誌大約為10分鐘一個(每個binlog日誌為1GB大小)。
在梳理了整個處理過程後,其實難度不高,主要是要細心,細心去排查每一個想到的點。 在非輕量級的資料庫中問題發生的概率也會隨著量級的增加而增多。而這恰恰是能夠磨練個人的成長。
同時,知識的儲備也要充足,這是進階高手的必要前提!